Find word near other word, within N# of words

2024/10/7 14:22:41

I need an enumerating regex function that identifies instances in a string when 'Word 1' is within N# words of 'Word 2'

For example, here is my dataframe and objective:

Pandas Dataframe Input

data = [['ABC123', 'This is the first example sentence the end of 
sentence one'], ['ABC456', 'This is the second example sentence one more 
sentence to come'], ['ABC789', 'There are no more example sentences']]
df = pd.DataFrame(data, columns=['Record ID', 'String'])
print(df)Record ID | String
ABC123    | This is the first example sentence the end of sentence one
ABC456    | This is the second example sentence one more sentence to come
ABC789    | There are no more example sentences

Word 1 = 'sentence'
Word 2 = 'the'
Within N# of words (displaced) = 3

Desired Dataframe Output

output_data = [['ABC123', 3], ['ABC456', 1], ['ABC789', 0]]
df = pd.DataFrame(output_data, columns=['Record ID', 'Occurrences Identified'])
print(df)Record ID | Occurrences Identified
ABC123    | 3
ABC456    | 1
ABC789    | 0

I think the regex part will take the general form of this, but I'm not sure how to apply it towards my use-case here in Python and ... I'm not sure where to start with a enumerate function.


I am also interested in simpler non-regex solutions, if any.

Data = pd.read_sql(query, engine)# Convert to Pandas DataFrame
nearwordDF = pd.DataFrame(Data)# Remove non-alpha characters and make all lowercase
nearwordDF['text'] = nearwordDF['text'].str.replace(',', ' ')
nearwordDF['text'] = nearwordDF['text'].str.replace('.', '')
nearwordDF['text'] = nearwordDF['text'].str.replace('?', '')
nearwordDF['text'] = nearwordDF['text'].str.replace('\r', '')
nearwordDF['text'] = nearwordDF['text'].str.lower()print(nearwordDF)
id        text
ABC123    how much money do i have in my money account
ABC456    where is my money
ABC789    hello  how are you today what is your name
DEF123    my money market fund is my only money of my accountimport re
import pandas as pdoutput = []
for i in nearwordDF:regex = r'(?:my(?:\s\w+){0,2})\s(?=money)|(?:money(?:\s\w+){0,2})\s(?=my)'nearwordDF = re.findall(regex, i[1])output.append([i[0], len(nearwordDF)])df = pd.DataFrame(output, columns=['Record ID', 'Occurrences'])
# Output
Record ID    Occurrences
i            0
t            0

Maybe regex is not the right solution here.

If you split your input string into a list, you can then locate the indices of words 1 and 2, and calculate how far away they are from each other:

string = 'This is the first example sentence the end of sentence one'
string_list = string.split(' ')
indices_word_1 = [i for i, x in enumerate(string_list) if x == "sentence"]
indices_word_2 = [i for i, x in enumerate(string_list) if x == "the"]
result = 0
for i in indices_word_1:for j in indices_word_2:_distance = abs(i - j)if _distance <= 3:result += 1

In this case the result is 3.

@tshobe, here is one way to implement my suggestion:

import pandasdef check_occurences(string, word_1='sentence', word_2='the', allowed_distance=3):string_list = string.split(' ')indices_word_1 = [i for i, x in enumerate(string_list) if x == word_1]indices_word_2 = [i for i, x in enumerate(string_list) if x == word_2]result = 0for i in indices_word_1:for j in indices_word_2:_distance = abs(i - j)if _distance <= allowed_distance:result += 1return resultdef main():data = [['ABC123', 'This is the first example sentence the end of sentence one'],['ABC456', 'This is the second example sentence one more sentence to come'],['ABC789', 'There are no more example sentences']]df = pandas.DataFrame(data, columns=['Record ID', 'String'])results_df = pandas.DataFrame(columns=['Record ID', 'Occurrences'])results_df['Record ID'] = df['Record ID']results_df['Occurrences'] = df['String'].apply(lambda x: check_occurences(x))print(results_df)if __name__ == "__main__":main()

