Matching several string matches from lists and making a new row for each match

2024/10/6 22:24:38

I have a data frame with text in one of the columns and I am using regex formatted strings to see if I can find any matches from three lists. However, when there are multiple matches from list 1, I want to make make a duplicate column for each of the matches. The one caveat is that the matches must be in consecutive order, with elements from lists list_2 and list_3 being optional.

I have an example below for what I would like the desired output to be.

list_1 = ['chest', 'test', 'west', 'nest']
list_2 = ['mike', 'bike', 'like', 'pike']
list_3 = ['hay', 'day', 'may', 'say']

sample DF:

text match_1 match_2 match_3
zzz zzz zz chest bike day zzzz z test mike zzz zzzz west zzz zz chest bike day
aaa aa aaa a nest aa aaaa aaa nest bike may aaaa aaa nest NaN NaN
ggg gg ggg ggg ggg test like hay ggg gg west ggg gggg west like test like hay

desired output:

text match_1 match_2 match_3
zzz zzz zz chest bike day zzzz z test mike zzz zzzz west zzz zz chest bike day
zzz zzz zz chest bike day zzzz z test mike zzz zzzz west zzz zz test mike NaN
zzz zzz zz chest bike day zzzz z test mike zzz zzzz west zzz zz west NaN NaN
aaa aa aaa a nest aa aaaa aaa nest bike may aaaa aaa nest NaN NaN
aaa aa aaa a nest aa aaaa aaa nest bike may aaaa aaa nest bike may
ggg gg ggg ggg ggg test like hay ggg gg west ggg gggg west like test like hay
ggg gg ggg ggg ggg test like hay ggg gg west ggg gggg west like west NaN NaN
ggg gg ggg ggg ggg test like hay ggg gg west ggg gggg west like west like NaN

I hope my description above was not too confusing. My current method is unable to match for text that has several matches from list_1 (as shown in the example above) with the optional matches from list_2 and list_3 being consecutive.

Thanks for your all your efforts!

Answer

You can build a regex programmatically from your word lists, using nested levels of optional parts to allow for possibly missing 2nd, 3rd etc. matches:

list_1 = ['chest', 'test', 'west', 'nest']
list_2 = ['mike', 'bike', 'like', 'pike']
list_3 = ['hay', 'day', 'may', 'say']
word_list = [list_1, list_2, list_3]
pattern = r'\b' + r'(?:\b\s+'.join(fr"(?P<match_{i+1}>{'|'.join(w)})" for i, w in enumerate(word_list)) + r'\b' + ''.join(')?' for _ in range(1, len(word_list)))

For your sample data, this gives:

\b(?P<match_1>chest|test|west|nest)(?:\b\s+(?P<match_2>mike|bike|like|pike)(?:\b\s+(?P<match_3>hay|day|may|say)\b)?)?

You can see this working on regex101.

You can then use that regex with extractall to find all matches in each text value, joining that result back to the original column.

out = df[['text']].join(df['text'].str.extractall(pattern).droplevel(1)).reset_index(drop=True)

For your sample data that gives the following result:

                                                text match_1 match_2 match_3
0   zzz zzz zz chest bike day zzzz z test mike zz...   chest    bike     day
1   zzz zzz zz chest bike day zzzz z test mike zz...    test    mike     NaN
2   zzz zzz zz chest bike day zzzz z test mike zz...    west     NaN     NaN
3   aaa aa aaa a nest aa aaaa aaa nest bike may a...    nest     NaN     NaN
4   aaa aa aaa a nest aa aaaa aaa nest bike may a...    nest    bike     may
5   ggg gg ggg ggg ggg test like hay ggg gg west ...    test    like     hay
6   ggg gg ggg ggg ggg test like hay ggg gg west ...    west     NaN     NaN
7   ggg gg ggg ggg ggg test like hay ggg gg west ...    west    like     NaN

Note that using variables list_1, list_2 is not good programming practice, you should use a list of lists instead (like word_list above).

https://en.xdnf.cn/q/118895.html

Related Q&A

Join and format array of objects in Python

I want to join and format values and array of objects to a string in python. Is there any way for me to do that?url = "https://google.com", search = "thai food", search_res = [{&q…

Copying text from file to specified Excel column [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 6…

Name error: Variable not defined

Program calculates the shortest route from point, to line, then to second point. Also I need to say how long is from the start of the line, to where point crosses. My code so far: from math import sqrt…

Error while deploying flask app on apache

I have a file manage.py, import os from app import create_app app = create_app(os.getenv(FLASK_CONFIG) or default) if __name__ == __main__:app.run()manage.py is working fine when tested in debug mode. …

Selenium Python get_element by ID failing

Can someone help me understand why my code fails to find the element by ID. Code below:from selenium import webdriver driver=webdriver.Firefox() driver.get(https://app.waitwhile.com/checkin/lltest3/use…

Pipelining POST requests with python-requests

Assuming that I can verify that a bunch of POST requests are in fact logically independent, how can I set up HTTP pipelining using python-requests and force it to allow POST requests in the pipeline?D…

How to take a whole matrix as a input in Python?

I want to take a whole matrix as an input in Python and store it in a dataframe. Pandas can do it automatically with read_csv function but it requires a CSV file. I want to input/copy-paste a matrix di…

Cannot create environment in anaconda, update conda , install packages

CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://repo.anaconda.com/pkgs/free/win-64/repodata.json.bz2 Elapsed: -An HTTP error occurred when trying to retrieve this URL. HTTP errors are often …

Inverted Triangle in Python-not running

I have to create a program to print an inverted triangle in python. When I was running it in Sublime Text 3 it did not run. By that, I mean that it did not even print a syntax error. def triangle():x =…

How to do Data profile to a table using pandas_profiling

When Im trying to do data profiling one sql server table by using pandas_profiling throwing an error like An attempt has been made to start a new process before thecurrent process has finished its boot…