Question 1

I need an enumerating regex function that identifies instances in a string when 'Word 1' is within N# words of 'Word 2'

For example, here is my dataframe and objective:

Pandas Dataframe Input

data = [['ABC123', 'This is the first example sentence the end of 
sentence one'], ['ABC456', 'This is the second example sentence one more 
sentence to come'], ['ABC789', 'There are no more example sentences']]
df = pd.DataFrame(data, columns=['Record ID', 'String'])
print(df)Record ID | String
----------|-----------------------
ABC123    | This is the first example sentence the end of sentence one
ABC456    | This is the second example sentence one more sentence to come
ABC789    | There are no more example sentences

Word 1 = 'sentence'
Word 2 = 'the'
Within N# of words (displaced) = 3

Desired Dataframe Output

output_data = [['ABC123', 3], ['ABC456', 1], ['ABC789', 0]]
df = pd.DataFrame(output_data, columns=['Record ID', 'Occurrences Identified'])
print(df)Record ID | Occurrences Identified
----------|-----------------------
ABC123    | 3
ABC456    | 1
ABC789    | 0

I think the regex part will take the general form of this, but I'm not sure how to apply it towards my use-case here in Python and ... I'm not sure where to start with a enumerate function.

\b(?:'sentence'\W+(?:\w+\W+){0,3}?'the'|'the'\W+(?:\w+\W+){0,3}?'sentence')\b

I am also interested in simpler non-regex solutions, if any.

Data = pd.read_sql(query, engine)# Convert to Pandas DataFrame
nearwordDF = pd.DataFrame(Data)# Remove non-alpha characters and make all lowercase
nearwordDF['text'] = nearwordDF['text'].str.replace(',', ' ')
nearwordDF['text'] = nearwordDF['text'].str.replace('.', '')
nearwordDF['text'] = nearwordDF['text'].str.replace('?', '')
nearwordDF['text'] = nearwordDF['text'].str.replace('\r', '')
nearwordDF['text'] = nearwordDF['text'].str.lower()print(nearwordDF)
--------------------------
id        text
ABC123    how much money do i have in my money account
ABC456    where is my money
ABC789    hello  how are you today what is your name
DEF123    my money market fund is my only money of my accountimport re
import pandas as pdoutput = []
for i in nearwordDF:regex = r'(?:my(?:\s\w+){0,2})\s(?=money)|(?:money(?:\s\w+){0,2})\s(?=my)'nearwordDF = re.findall(regex, i[1])output.append([i[0], len(nearwordDF)])df = pd.DataFrame(output, columns=['Record ID', 'Occurrences'])
print(df)-----------------------------------
# Output
Record ID    Occurrences
i            0
t            0

Question 2

Maybe regex is not the right solution here.

If you split your input string into a list, you can then locate the indices of words 1 and 2, and calculate how far away they are from each other:

string = 'This is the first example sentence the end of sentence one'
string_list = string.split(' ')
indices_word_1 = [i for i, x in enumerate(string_list) if x == "sentence"]
indices_word_2 = [i for i, x in enumerate(string_list) if x == "the"]
result = 0
for i in indices_word_1:for j in indices_word_2:_distance = abs(i - j)if _distance <= 3:result += 1

In this case the result is 3.

@tshobe, here is one way to implement my suggestion:

import pandasdef check_occurences(string, word_1='sentence', word_2='the', allowed_distance=3):string_list = string.split(' ')indices_word_1 = [i for i, x in enumerate(string_list) if x == word_1]indices_word_2 = [i for i, x in enumerate(string_list) if x == word_2]result = 0for i in indices_word_1:for j in indices_word_2:_distance = abs(i - j)if _distance <= allowed_distance:result += 1return resultdef main():data = [['ABC123', 'This is the first example sentence the end of sentence one'],['ABC456', 'This is the second example sentence one more sentence to come'],['ABC789', 'There are no more example sentences']]df = pandas.DataFrame(data, columns=['Record ID', 'String'])results_df = pandas.DataFrame(columns=['Record ID', 'Occurrences'])results_df['Record ID'] = df['Record ID']results_df['Occurrences'] = df['String'].apply(lambda x: check_occurences(x))print(results_df)if __name__ == "__main__":main()

Find word near other word, within N# of words

Related Q&A

Create new files, dont overwrite existing files, in python

Python List comprehension execution order [duplicate]

Subtract two strings in python

Python assignment for a phonebook

ImportError: No module named application [duplicate]

Detect keypress without drawing canvas or frame on tkinter [closed]

regex to extract a set number of words around a matched word

How do I make a minimal and reproducible example for neural networks?

Increase the capture and stream speed of a video using OpenCV and Python [duplicate]

Getting Pyphons Tkinter to update a label with a changing variable [duplicate]