Extract GPS coordinates from .docx file with python

2024/10/9 2:30:58

I have some hectic task to do for which I need some help from python. Please see this word document.

enter image description here

I am to extract texts and GPS coordinates from each row. There are currently over 100 coordinates in 10 docx file. My "hefty" python knowledge get me to this.

from docx import Document
import remain_file = Document("D:/DOCUMENTS/Google_Link/1  Category I/1  Category 
I.docx")
table = main_file.tables[1] #this is same for every documentdata = []
keys = Nonefor i, row in enumerate(table.rows):text = (cell.text for cell in row.cells)if i == 0:keys = tuple(text)continuerow_data = tuple(text)
data.append(row_data)regexReference = re.compile("(C.-)\w+")
colReference = [item[1] for item in data]listReference = filter(regexReference.match, colReference)for i in listReference:print i.encode('UTF-8')

I can print 16 reference ids from column 2. Please guide me to print something like this.

C1-20701-17-1some site, some regionThe existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires 
some repair/maintenance works including electrical wiring and electrical 
lights and appliances like ceiling fans supplies. Detail specification of 
the works are attachedx = 91°38'28.2"E
y = 22°40'34.3"N

These XY locations and descritions will be used to create KML files afterwards and attach with each document. I'd prefer a variable for each part of the above section (ref id, location, description, x and y) so that I can automate that as well.

demo docx

Answer

I don't know if this works if there are files with different patterns (p.s. I'm using python 2.7.11):

# -*- coding: utf-8 -*-
from docx import Document
import sys
import os
import rereload(sys)
sys.setdefaultencoding('utf8')for root, dirs, files in os.walk("."):for name in files:doc_file = os.path.join(root, name)if doc_file.endswith('docx'):main_file = Document(doc_file)table = main_file.tables[1]  # this is same for every documentdata = []keys = Nonefor i, row in enumerate(table.rows):text = (cell.text for cell in row.cells)if i == 0:keys = tuple(text)continuerow_data = tuple(text)data.append(row_data)regexReference = re.compile("(C.-[0-9-]+)")regexCoordinate = re.compile(r'(N-(.{,12})([0-9]|\')|[0-9].{,12}N)[;, ]+(E-(.{,12})([0-9]|\')|[0-9].{,12}E)')result = []for item in data:tmp = dict()matchReference = regexReference.search(item[1])matchCoordinate = regexCoordinate.search(unicode(item[2]))if matchReference:tmp['reference'] = matchReference.group()if matchCoordinate:tmp['x'] = matchCoordinate.group(1)tmp['y'] = matchCoordinate.group(4)tmp['description'] = unicode(item[2])tmp['location'] = unicode(item[3])result.append(tmp)for rs in result:if 'reference' in rs:for k, v in rs.iteritems():print('{} = {}'.format(k, v))print# Output:
# --------------------------------
# y = 91°38'28.2"E
# x = 22°40'34.3"N
# description = The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires some repair/maintenance works including electrical wiring and electrical lights and appliances like ceiling fans supplies. Detail specification of the works are attached.
# reference = C1-20701-17-1
# location = xxxxx Site, c Region
https://en.xdnf.cn/q/118645.html

Related Q&A

How to Serialize SQL data after a Join using Marshmallow? (flask extension)

I have 2 tables in SQL: class Zoo(db.Model):id = db.Column(db.Integer, primary_key=True)nome = db.Column(db.String(80), unique=True, nullable=False)idade = db.Column(db.Integer, unique=False, nullable=…

Python readline() is not reading a line with single space

I am reading a text file using readline(). My file contains below content with a space at second line:!" # $ % & When I print read values using-print("%s\n" % iso5_char)It prints-!&q…

Bisection search [duplicate]

This question already has answers here:Closed 11 years ago.Possible Duplicate:Using bisection search to determine I have posted other thread but it did not receive answers thus im trying to provide so…

What am i doing wrong with matplotlibs yticks?

My code is as follows (dont ask for the variable names, im german^^):import matplotlib.pyplot as plt import numpy as np strecke = [] zeit = []daten = open("BewegungBeschleunigung.csv")for i i…

Python- How to implement quick sort when middle element is the pivot?

There are many different versions of quickSort that pick pivot in different ways.Always pick the first element or the last element as the pivot Pick a random element as a pivot. Pick median as the pivo…

How to expose a form when a radio button is checked?

Consider the following sample html template,<html><body><input type="radio" name="x">x</input><br><input type="radio" name="y"&g…

How to generate perlin noise in pygame?

I am trying to make a survival game and I have a problem with perlin noise. My program gives me this:But I want something like islands or rivers. Heres my code: #SetUp# import pygame, sys, random pygam…

Inputting range of ports with nmap optparser

This is the scriptimport nmap import optparsedef nmapScan(tgtHost,tgtPort):nmScan = nmap.PortScanner()nmScan.scan(tgtHost,tgtPort)state=nmScan[tgtHost][tcp][int(tgtPort)][state]print "[*] " +…

Implementing a Python algorithm for solving the n-queens problem efficiently

I am working on a project that requires me to solve the n-queens problem efficiently using Python. I have already implemented a basic recursive algorithm to generate all possible solutions, but I am lo…

Annotations with pointplot

I am using a pointplot in seaborn.import seaborn as sns sns.set_style("darkgrid") tips = sns.load_dataset("tips") ax = sns.pointplot(x="time", y="total_bill", hu…