Python Text to Data Frame with Specific Pattern

2024/11/16 16:30:42

I am trying to convert a bunch of text files into a data frame using Pandas.

Thanks to Stack Overflow's amazing community, I almost got the desired output (OP: Python Text File to Data Frame with Specific Pattern).

Basically I need to turn a text with specific patterns (but sometimes missing data) into a data frame using Pandas.

Here is an example:

Number 01600 London                           Register  4314Some random text...************************************* B ***************************************1 SHARE: 73/1284John SmithBORN: 1960-01-01 ADDR: Streetname 3/2   1000f 4222/2001h 1334/2000i 5774/20004 SHARE: 58/1284Boris MorganBORN:            ADDR: Streetname 4   20005 SHARE: 23/1284James KleinBORN:            ADDR:      c 4222/1988 Supporting Textf 4222/2000 Extra Text
************************************* C ***************************************
More random text...

From the example above, I need to transform the text between ***B*** and ***C*** into a data frame with the following output:

Number Register City Id Share Name Born Address c f h i
01600 4314 London 1 73/1284 John Smith 1960-01-01 Streetname 3/2 1000 NaN 4222/2001 1334/2000 5774/2000
01600 4314 London 4 58/1284 Boris Morgan NaN Streetname 4 2000 NaN NaN NaN NaN
01600 4314 London 5 23/1284 James Klein NaN NaN 4222/1988 Supporting Text 4222/2000 Extra Text NaN NaN

Some of the patterns:

  • The first row for the group contains the word SHARE; before this word is the Id and after it is the Share.

  • The second row contains the name of the person (and should be extracted entirely to the Name variable).

  • The third row contains the birthdate (BORN) and the address (ADDR). Sometimes this information is missing - in these cases, the variables Born and Address should be NaN.

  • When it exists, the fourth row and beyond (which lasts until the next group is reached) starts with a lowercase letter. Each of these rows should be extracted until the end of the paragraph to a variable with the name being the leading lowercase letter.

The code below works when the birthdate and address are available, and when the fourth row and beyond contains only one chunk of information (in the previous example, SHARE: 73/1284 from John Smith has rows f, h and i - all with only one chunk of information and SHARE: 23/1284 from James Klein contains multiple chunks).

import pandas as pdtext = '''Number 01600 London                           Register  4314Some random text...************************************* B ***************************************1 SHARE: 73/1284John SmithBORN: 1960-01-01 ADDR: Streetname 3/2   1000f 4222/2001h 1334/2000i 5774/20004 SHARE: 58/1284Boris MorganBORN:            ADDR: Streetname 4   20005 SHARE: 23/1284James KleinBORN:            ADDR:      c 4222/1988 Supporting Textf 4222/2000 Extra Text
************************************* C ***************************************
More random text...'''text = [i.strip() for i in text.splitlines()] # create a list of linesdata = []# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]for i in items:d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }items = list(s.split() for s in i[3:])merged_items = []for i in items:if len(i[0]) == 1 and i[0].isalpha():merged_items.append(i)else:merged_items[-1][-1] = merged_items[-1][-1] + i[0]d.update({name: value for name,value in merged_items})data.append(d)#load the list of dicts as a dataframe
df = pd.DataFrame(data)

Does anyone know how to fix these issues? Thanks in advance.

Answer

I wouldn't use the same variable i for both inner and outer loops. Changing your for loop to the following should work cleaner:

for i in items:d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], }if "ADDR" in i[2]:born, address = i[2].split("ADDR:")d['Born'] = born.replace("BORN:", "").strip()d['Address'] = address.strip()else:d['Born']: i[2].split()[1]if len(i)>3:for j in i[3:]:key, value = j.split(" ", 1)d[key] = valuedata.append(d)#load the list of dicts as a dataframe
df = pd.DataFrame(data)
https://en.xdnf.cn/q/119504.html

Related Q&A

Python Multiprocess OpenCV Webcam Get Request [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.This question was caused by a typo or a problem that can no longer be reproduced. While similar q…

error when trying to run my tensorflow code

This is a follow up question from my latest post: Put input in a tensorflow neural network I precoded a neural network using tensorflow with the MNIST dataset, and with the help of @FinnE was able to c…

ValueError: invalid literal for int() with base 10: Height (mm)

import csv from decimal import *def mean(data_set):return Decimal(sum(data_set)) / len(data_set)def variance(data_set):mean_res = mean(data_set)differences = []squared_res = []for elem in data_set:diff…

Having trouble in getting page source with beautifulsoup

I am trying to get the HTML source of a web page using beautifulsoup. import bs4 as bs import requests import urllib.request sourceUrl=https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-…

Cannot convert from pandas._libs.tslibs.timestamps.Timestamp to datetime.datetime

Im trying to convert from pandas._libs.tslibs.timestamps.Timestamp to datetime.datetime but the change is not saved: type(df_unix.loc[45,LastLoginDate])OUTPUT: pandas._libs.tslibs.timestamps.Timestampt…

string index out of range Python, Django

im using Django to develop a web application. When i try and run it on my web form i am receiving string index out of range error. However, when i hardcode a dictionary in to a python test file it work…

PyQt UI event control segregation

I am a beginner in python but my OOPS concept from Java and Android are strong enough to motivate me in making some tool in python. I am using PyQt for developing the application. In my application th…

How to re-run process Linux after crash? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.This question does not appear to be about a specific programming problem, a software algorithm, or s…

Number of pairs

I am trying to write a code that takes m. a, a list of integers n. b, an integer and returns the number of pairs (m,n) with m,n in a such that |m-n|<=b. So far, Ive got this def nearest_pairs(a, b):…

xml.etree.ElementTree.ParseError: not well-formed

I have the following code:from xml.etree import ElementTreefile_path = some_file_pathdocument = ElementTree.parse(file_path, ElementTree.XMLParser(encoding=utf-8))If my XML looks like the following it …