I have a data frame with text in one of the columns and I am using regex formatted strings to see if I can find any matches from three lists. However, when there are multiple matches from list 1, I want to make make a duplicate column for each of the matches. The one caveat is that the matches must be in consecutive order, with elements from lists list_2 and list_3 being optional.
I have an example below for what I would like the desired output to be.
list_1 = ['chest', 'test', 'west', 'nest']
list_2 = ['mike', 'bike', 'like', 'pike']
list_3 = ['hay', 'day', 'may', 'say']
sample DF:
text | match_1 | match_2 | match_3 |
---|---|---|---|
zzz zzz zz chest bike day zzzz z test mike zzz zzzz west zzz zz | chest | bike | day |
aaa aa aaa a nest aa aaaa aaa nest bike may aaaa aaa | nest | NaN | NaN |
ggg gg ggg ggg ggg test like hay ggg gg west ggg gggg west like | test | like | hay |
desired output:
text | match_1 | match_2 | match_3 |
---|---|---|---|
zzz zzz zz chest bike day zzzz z test mike zzz zzzz west zzz zz | chest | bike | day |
zzz zzz zz chest bike day zzzz z test mike zzz zzzz west zzz zz | test | mike | NaN |
zzz zzz zz chest bike day zzzz z test mike zzz zzzz west zzz zz | west | NaN | NaN |
aaa aa aaa a nest aa aaaa aaa nest bike may aaaa aaa | nest | NaN | NaN |
aaa aa aaa a nest aa aaaa aaa nest bike may aaaa aaa | nest | bike | may |
ggg gg ggg ggg ggg test like hay ggg gg west ggg gggg west like | test | like | hay |
ggg gg ggg ggg ggg test like hay ggg gg west ggg gggg west like | west | NaN | NaN |
ggg gg ggg ggg ggg test like hay ggg gg west ggg gggg west like | west | like | NaN |
I hope my description above was not too confusing. My current method is unable to match for text that has several matches from list_1 (as shown in the example above) with the optional matches from list_2 and list_3 being consecutive.
Thanks for your all your efforts!