Removing named entities from a document using spacy

2024/10/15 13:15:17

I have tried to remove words from a document that are considered to be named entities by spacy, so basically removing "Sweden" and "Nokia" from the string example. I could not find a way to work around the problem that entities are stored as a span. So when comparing them with single tokens from a spacy doc, it prompts an error.

In a later step, this process is supposed to be a function applied to several text documents stored in a pandas data frame.

I would appreciate any kind of help and advice on how to maybe better post questions as this is my first one here.


nlp = spacy.load('en')text_data = u'This is a text document that speaks about entities like Sweden and Nokia'document = nlp(text_data)text_no_namedentities = []for word in document:if word not in document.ents:text_no_namedentities.append(word)return " ".join(text_no_namedentities)

It creates the following error:

TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got spacy.tokens.span.Span)

Answer

This will not handle entities covering multiple tokens.

import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:if item.text in ents:passelse:text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))

Output

'New York is in'

Here USA is correctly removed but couldn't eliminate New York

Solution

import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
print(" ".join([ent.text for ent in document if not ent.ent_type_]))

Output

'is in'

https://en.xdnf.cn/q/69283.html

Related Q&A

Install wxPython in osx 10.11

When I try to install wxPython, it shows an error: > The Installer could not install the software because there was no > software found to install.How can I fix it?

merging recurrent layers with dense layer in Keras

I want to build a neural network where the two first layers are feedforward and the last one is recurrent. here is my code :model = Sequential() model.add(Dense(150, input_dim=23,init=normal,activation…

How to manually mark a Celery task as done and set its result?

I have this Celery task:@app.task def do_something(with_this):# instantiate a class from a third party libraryinstance = SomeClass()# this class uses callbacks to send progress info about# the status a…

How to sort a numpy array based on the values in a specific row?

I was wondering how I would be able to sort a whole array by the values in one of its columns.I have :array([5,2,8,2,4])and:array([[ 0, 1, 2, 3, 4],[ 5, 6, 7, 8, 9],[10, 11, 12, 13, 14],[15, 16…

python regex match optional square brackets

I have the following strings:1 "R J BRUCE & OTHERS V B J & W L A EDWARDS And Ors CA CA19/02 27 February 2003", 2 "H v DIRECTOR OF PROCEEDINGS [2014] NZHC 1031 [16 May 2014]&…

How to open console in firefox python selenium?

Im trying to open firefox console through Selenium with Python. How can I open firefox console with python selenium? Is it possible to send keys to the driver or something like that?

Can python coverage module conditionally ignore lines in a unit test?

Using nosetests and the coverage module, I would like coverage reports for code to reflect the version being tested. Consider this code:import sys if sys.version_info < (3,3):print(older version of …

Delete Pandas DataFrame row where column value is 0

I already read the answers in this thread but it doesnt answer my exact problem. My DataFrame looks like thisLady in the Water The Night Listener Just My Luck Correlation Claudia Puig …

Pyarrow s3fs partition by timestamp

Is it possible to use a timestamp field in the pyarrow table to partition the s3fs file system by "YYYY/MM/DD/HH" while writing parquet file to s3?

flask run vs. python

Im having difficulty getting my flask app to run by using the "python" method. I have no problems usingexport FLASK_APP=microblog.py flask runbut attempting to usepython microblog.pywill resu…