I need an enumerating regex function that identifies instances in a string when 'Word 1' is within N# words of 'Word 2'
For example, here is my dataframe and objective:
Pandas Dataframe Input
data = [['ABC123', 'This is the first example sentence the end of
sentence one'], ['ABC456', 'This is the second example sentence one more
sentence to come'], ['ABC789', 'There are no more example sentences']]
df = pd.DataFrame(data, columns=['Record ID', 'String'])
print(df)Record ID | String
----------|-----------------------
ABC123 | This is the first example sentence the end of sentence one
ABC456 | This is the second example sentence one more sentence to come
ABC789 | There are no more example sentences
Word 1 = 'sentence'
Word 2 = 'the'
Within N# of words (displaced) = 3
Desired Dataframe Output
output_data = [['ABC123', 3], ['ABC456', 1], ['ABC789', 0]]
df = pd.DataFrame(output_data, columns=['Record ID', 'Occurrences Identified'])
print(df)Record ID | Occurrences Identified
----------|-----------------------
ABC123 | 3
ABC456 | 1
ABC789 | 0
I think the regex part will take the general form of this, but I'm not sure how to apply it towards my use-case here in Python and ... I'm not sure where to start with a enumerate function.
\b(?:'sentence'\W+(?:\w+\W+){0,3}?'the'|'the'\W+(?:\w+\W+){0,3}?'sentence')\b
I am also interested in simpler non-regex solutions, if any.
Data = pd.read_sql(query, engine)# Convert to Pandas DataFrame
nearwordDF = pd.DataFrame(Data)# Remove non-alpha characters and make all lowercase
nearwordDF['text'] = nearwordDF['text'].str.replace(',', ' ')
nearwordDF['text'] = nearwordDF['text'].str.replace('.', '')
nearwordDF['text'] = nearwordDF['text'].str.replace('?', '')
nearwordDF['text'] = nearwordDF['text'].str.replace('\r', '')
nearwordDF['text'] = nearwordDF['text'].str.lower()print(nearwordDF)
--------------------------
id text
ABC123 how much money do i have in my money account
ABC456 where is my money
ABC789 hello how are you today what is your name
DEF123 my money market fund is my only money of my accountimport re
import pandas as pdoutput = []
for i in nearwordDF:regex = r'(?:my(?:\s\w+){0,2})\s(?=money)|(?:money(?:\s\w+){0,2})\s(?=my)'nearwordDF = re.findall(regex, i[1])output.append([i[0], len(nearwordDF)])df = pd.DataFrame(output, columns=['Record ID', 'Occurrences'])
print(df)-----------------------------------
# Output
Record ID Occurrences
i 0
t 0