Find word near other word, within N# of words

2024/10/7 14:22:41

I need an enumerating regex function that identifies instances in a string when 'Word 1' is within N# words of 'Word 2'

For example, here is my dataframe and objective:

Pandas Dataframe Input

data = [['ABC123', 'This is the first example sentence the end of 
sentence one'], ['ABC456', 'This is the second example sentence one more 
sentence to come'], ['ABC789', 'There are no more example sentences']]
df = pd.DataFrame(data, columns=['Record ID', 'String'])
print(df)Record ID | String
----------|-----------------------
ABC123    | This is the first example sentence the end of sentence one
ABC456    | This is the second example sentence one more sentence to come
ABC789    | There are no more example sentences

Word 1 = 'sentence'
Word 2 = 'the'
Within N# of words (displaced) = 3

Desired Dataframe Output

output_data = [['ABC123', 3], ['ABC456', 1], ['ABC789', 0]]
df = pd.DataFrame(output_data, columns=['Record ID', 'Occurrences Identified'])
print(df)Record ID | Occurrences Identified
----------|-----------------------
ABC123    | 3
ABC456    | 1
ABC789    | 0

I think the regex part will take the general form of this, but I'm not sure how to apply it towards my use-case here in Python and ... I'm not sure where to start with a enumerate function.

\b(?:'sentence'\W+(?:\w+\W+){0,3}?'the'|'the'\W+(?:\w+\W+){0,3}?'sentence')\b

I am also interested in simpler non-regex solutions, if any.

Data = pd.read_sql(query, engine)# Convert to Pandas DataFrame
nearwordDF = pd.DataFrame(Data)# Remove non-alpha characters and make all lowercase
nearwordDF['text'] = nearwordDF['text'].str.replace(',', ' ')
nearwordDF['text'] = nearwordDF['text'].str.replace('.', '')
nearwordDF['text'] = nearwordDF['text'].str.replace('?', '')
nearwordDF['text'] = nearwordDF['text'].str.replace('\r', '')
nearwordDF['text'] = nearwordDF['text'].str.lower()print(nearwordDF)
--------------------------
id        text
ABC123    how much money do i have in my money account
ABC456    where is my money
ABC789    hello  how are you today what is your name
DEF123    my money market fund is my only money of my accountimport re
import pandas as pdoutput = []
for i in nearwordDF:regex = r'(?:my(?:\s\w+){0,2})\s(?=money)|(?:money(?:\s\w+){0,2})\s(?=my)'nearwordDF = re.findall(regex, i[1])output.append([i[0], len(nearwordDF)])df = pd.DataFrame(output, columns=['Record ID', 'Occurrences'])
print(df)-----------------------------------
# Output
Record ID    Occurrences
i            0
t            0
Answer

Maybe regex is not the right solution here.

If you split your input string into a list, you can then locate the indices of words 1 and 2, and calculate how far away they are from each other:

string = 'This is the first example sentence the end of sentence one'
string_list = string.split(' ')
indices_word_1 = [i for i, x in enumerate(string_list) if x == "sentence"]
indices_word_2 = [i for i, x in enumerate(string_list) if x == "the"]
result = 0
for i in indices_word_1:for j in indices_word_2:_distance = abs(i - j)if _distance <= 3:result += 1

In this case the result is 3.

@tshobe, here is one way to implement my suggestion:

import pandasdef check_occurences(string, word_1='sentence', word_2='the', allowed_distance=3):string_list = string.split(' ')indices_word_1 = [i for i, x in enumerate(string_list) if x == word_1]indices_word_2 = [i for i, x in enumerate(string_list) if x == word_2]result = 0for i in indices_word_1:for j in indices_word_2:_distance = abs(i - j)if _distance <= allowed_distance:result += 1return resultdef main():data = [['ABC123', 'This is the first example sentence the end of sentence one'],['ABC456', 'This is the second example sentence one more sentence to come'],['ABC789', 'There are no more example sentences']]df = pandas.DataFrame(data, columns=['Record ID', 'String'])results_df = pandas.DataFrame(columns=['Record ID', 'Occurrences'])results_df['Record ID'] = df['Record ID']results_df['Occurrences'] = df['String'].apply(lambda x: check_occurences(x))print(results_df)if __name__ == "__main__":main()
https://en.xdnf.cn/q/118810.html

Related Q&A

Create new files, dont overwrite existing files, in python

Im writing to a file in three functions and im trying not overwrite the file. I want every time i run the code i generate a new filewith open("atx.csv", w)as output:writer = csv.writer(output…

Python List comprehension execution order [duplicate]

This question already has answers here:Understanding nested list comprehension [duplicate](2 answers)Closed 4 years ago.matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] squared = [[x**2 for x in row] for row…

Subtract two strings in python

I should calculate the difference between elements two different list. This is my code :import operator a = [5, 35.1, FFD] b = [8.5, 11.3, AMM] difference = [each[0] - each[1] for each in zi…

Python assignment for a phonebook

This weeks lab is based on the example on pages 53,54 of the wikibook "Non-Programmers Tutorial For Python" by Josh Cogliati (2005), (see http://en.wikibooks.org/wiki/Non-Programmer%27s_Tutor…

ImportError: No module named application [duplicate]

This question already has answers here:What is __init__.py for?(14 answers)Closed 6 years ago.I am running a flask application and connecting to database with Flask-mysqlAlchemy when I am running my s…

Detect keypress without drawing canvas or frame on tkinter [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 6…

regex to extract a set number of words around a matched word

I was looking around for a way to grab words around a found match, but they were much too complicated for my case. All I need is a regex statement to grab, lets say 10, words before and after a matched…

How do I make a minimal and reproducible example for neural networks?

I would like to know how to make a minimal and reproducible deep learning example for Stack Overflow. I want to make sure that people have enough information to pinpoint the exact problem with my code.…

Increase the capture and stream speed of a video using OpenCV and Python [duplicate]

This question already has answers here:OpenCV real time streaming video capture is slow. How to drop frames or get synced with real time?(4 answers)Closed 2 years ago.I need to take a video and analyz…

Getting Pyphons Tkinter to update a label with a changing variable [duplicate]

This question already has answers here:Making python/tkinter label widget update?(5 answers)Closed 8 years ago.I have a python script which I have written for a Raspberry Pi project, the script reads …