Question 1

I have a data frame with text in one of the columns and I am using regex formatted strings to see if I can find any matches from three lists. However, when there are multiple matches from list 1, I want to make make a duplicate column for each of the matches. The one caveat is that the matches must be in consecutive order, with elements from lists list_2 and list_3 being optional.

I have an example below for what I would like the desired output to be.

list_1 = ['chest', 'test', 'west', 'nest']
list_2 = ['mike', 'bike', 'like', 'pike']
list_3 = ['hay', 'day', 'may', 'say']

sample DF:

text	match_1	match_2	match_3
zzz zzz zz chest bike day zzzz z test mike zzz zzzz west zzz zz	chest	bike	day
aaa aa aaa a nest aa aaaa aaa nest bike may aaaa aaa	nest	NaN	NaN
ggg gg ggg ggg ggg test like hay ggg gg west ggg gggg west like	test	like	hay

desired output:

text	match_1	match_2	match_3
zzz zzz zz chest bike day zzzz z test mike zzz zzzz west zzz zz	chest	bike	day
zzz zzz zz chest bike day zzzz z test mike zzz zzzz west zzz zz	test	mike	NaN
zzz zzz zz chest bike day zzzz z test mike zzz zzzz west zzz zz	west	NaN	NaN
aaa aa aaa a nest aa aaaa aaa nest bike may aaaa aaa	nest	NaN	NaN
aaa aa aaa a nest aa aaaa aaa nest bike may aaaa aaa	nest	bike	may
ggg gg ggg ggg ggg test like hay ggg gg west ggg gggg west like	test	like	hay
ggg gg ggg ggg ggg test like hay ggg gg west ggg gggg west like	west	NaN	NaN
ggg gg ggg ggg ggg test like hay ggg gg west ggg gggg west like	west	like	NaN

I hope my description above was not too confusing. My current method is unable to match for text that has several matches from list_1 (as shown in the example above) with the optional matches from list_2 and list_3 being consecutive.

Thanks for your all your efforts!

Question 2

You can build a regex programmatically from your word lists, using nested levels of optional parts to allow for possibly missing 2nd, 3rd etc. matches:

list_1 = ['chest', 'test', 'west', 'nest']
list_2 = ['mike', 'bike', 'like', 'pike']
list_3 = ['hay', 'day', 'may', 'say']
word_list = [list_1, list_2, list_3]
pattern = r'\b' + r'(?:\b\s+'.join(fr"(?P<match_{i+1}>{'|'.join(w)})" for i, w in enumerate(word_list)) + r'\b' + ''.join(')?' for _ in range(1, len(word_list)))

For your sample data, this gives:

\b(?P<match_1>chest|test|west|nest)(?:\b\s+(?P<match_2>mike|bike|like|pike)(?:\b\s+(?P<match_3>hay|day|may|say)\b)?)?

You can see this working on regex101.

You can then use that regex with extractall to find all matches in each text value, joining that result back to the original column.

out = df[['text']].join(df['text'].str.extractall(pattern).droplevel(1)).reset_index(drop=True)

For your sample data that gives the following result:

                                                text match_1 match_2 match_3
0   zzz zzz zz chest bike day zzzz z test mike zz...   chest    bike     day
1   zzz zzz zz chest bike day zzzz z test mike zz...    test    mike     NaN
2   zzz zzz zz chest bike day zzzz z test mike zz...    west     NaN     NaN
3   aaa aa aaa a nest aa aaaa aaa nest bike may a...    nest     NaN     NaN
4   aaa aa aaa a nest aa aaaa aaa nest bike may a...    nest    bike     may
5   ggg gg ggg ggg ggg test like hay ggg gg west ...    test    like     hay
6   ggg gg ggg ggg ggg test like hay ggg gg west ...    west     NaN     NaN
7   ggg gg ggg ggg ggg test like hay ggg gg west ...    west    like     NaN

Note that using variables list_1, list_2 is not good programming practice, you should use a list of lists instead (like word_list above).

Matching several string matches from lists and making a new row for each match

Related Q&A

Join and format array of objects in Python

Copying text from file to specified Excel column [closed]

Name error: Variable not defined

Error while deploying flask app on apache

Selenium Python get_element by ID failing

Pipelining POST requests with python-requests

How to take a whole matrix as a input in Python?

Cannot create environment in anaconda, update conda , install packages

Inverted Triangle in Python-not running

How to do Data profile to a table using pandas_profiling