Extracting information from pandas dataframe

2024/10/14 7:16:47

I have the below dataframe. I want to build a rule engine to extract the tokens where the pattern is like Eg. "UNITED STATES" .What is the best way to do it ? Is there anything like regex or CGUL for this kind of tasks? Any suggestions would be appreciated.

WORD_INDEX  WORD_TOKEN  WORD_POS
0           TRUMP       PROPN
1           IS          ADP
2           THE         ADP
3           PRESIDENT   NOUN
4           OF          ADP
5           THE         ADP
6           UNITED      NOUN
7           STATES      NOUN

I want to start with WORD_POS and find the WORD_TOKEN. Any idea how to do that? For example, I want to find the WORD_TOKENs where the WORD_POS is NOUN and then next WORD_POS is also NOUN.

Answer

You may want to use the contains string method, which takes a regex argument by default. For example

mask = df['WORD_TOKEN'].str.contains('(UNITED|STATES)')
print(df[mask])

This will match anything containing "united" or "states".

https://en.xdnf.cn/q/117984.html

Related Q&A

scipy import error with pyinstaller

I am trying to build a "One File" executable for my project with pyinstaller and a .spec file. The content of the spec file is as follows:# -*- mode: python -*-block_cipher = Nonea = Analysi…

How to compare meaningful level of a set of phrase that describe same concept in NLP?

I have two terms "vehicle" and "motor vehicle". Are there any way to compare the meaningfulness level or ambiguity level of these two in NLP? The outcome should be that "motor…

TypeError: slice indices must be integers or None or have an __index__ method. How to resolve it?

if w<h:normalized_char = np.ones((h, h), dtype=uint8)start = (h-w)/2normalized_char[:, start:start+w] = charelse:normalized_char = np.ones((w, w), dtype=uint8)start = (w-h)/2normalized_char[start:st…

Keras: Understanding the number of trainable LSTM parameters

I have run a Keras LSTM demo containing the following code (after line 166):m = 1 model=Sequential() dim_in = m dim_out = m nb_units = 10model.add(LSTM(input_shape=(None, dim_in),return_sequences=True,…

Updating Labels in Tkinter with for loop

So Im trying to print items in a list dynamically on 10 tkinter Labels using a for loop. Currently I have the following code:labe11 = StringVar() list2_placer = 0 list1_placer = 1 mover = 227 for items…

Paginate results, offset and limit

If I am developing a web service for retrieving some album names of certain artist using an API, and I am asked:The service should give the possibility to paginate results. It should support ofset= and…

Improve code to find prime numbers

I wrote this python code about 3 days ago, and I am stuck here, I think it could be better, but I dont know how to improve it. Can you guys please help me?# Function def is_prime(n):if n == 2 or n == …

How to read the line that contains a string then extract this line without this string

I have a file .txt that contains a specific line, like thisfile.txt. . T - Python and Matplotlib Essentials for Scientists and Engineers . A - Wood, M.A. . . .I would like to extract lines that contain…

Python: How to access and iterate over a list of div class element using (BeautifulSoup)

Im parsing data about car production with BeautifulSoup (see also my first question):from bs4 import BeautifulSoup import stringhtml = """ <h4>Production Capacity (year)</h4>…

What should I worry about Python template engines and web frameworks? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, argum…