Determine if a text extract from spacy is a complete sentence

2024/10/12 3:23:47

We are working on sentences extracted from a PDF. The problem is that it includes the title, footers, table of contents, etc. Is there a way to determine if the sentence we get when pass the document to spacy is a complete sentence. Is there a way to filter parts of sentences like titles?

Answer

A complete sentence contains at least one subject, one predicate, one object, and closes with punctuation. Subject and object are almost always nouns, and the predicate is always a verb.

Thus you need to check if your sentence contains two nouns, one verb and closes with punctuation:

import spacynlp = spacy.load("en_core_web_sm")
doc = nlp("I. Introduction\nAlfred likes apples! A car runs over a red light.")
for sent in doc.sents:if sent[0].is_title and sent[-1].is_punct:has_noun = 2has_verb = 1for token in sent:if token.pos_ in ["NOUN", "PROPN", "PRON"]:has_noun -= 1elif token.pos_ == "VERB":has_verb -= 1if has_noun < 1 and has_verb < 1:print(sent.string.strip())

Update

I also would advise to check if the sentence starts with an upper case letter, I added the modification in the code. Furthermore, I would like to point out that what I wrote is true for English and German, I don't know how it is in other languages.

https://en.xdnf.cn/q/69698.html

Related Q&A

Drawing labels that follow their edges in a Networkx graph

Working with Networkx, I have several edges that need to be displayed in different ways. For that I use the connectionstyle, some edges are straight lines, some others are Arc3. The problem is that eve…

randomly choose 100 documents under a directory

There are about 2000 documents under the directory. I want to randomly select some documents and copy them to a new directory automatically.Some relevant information about generating one document name …

Oauth client initialization in python for tumblr API using Python-oauth2

Im new to Oauth. In the past for twitter applications written in Python i used python-oauth2 library to initialize client like this:consumer = oauth.Consumer(key = CONSUMER_KEY, secret = CONSUMER_SECRE…

Model description in django-admin

Is it possible to put a model description or description on the list display page of a certain model in django-admin?Im talking about something like when you click a model name link on the homepage of…

Print underscore separated integer

Since python3.6, you can use underscore to separate digits of an integer. For examplex = 1_000_000 print(x) #1000000This feature was added to easily read numbers with many digits and I found it very u…

What does (numpy) __array_wrap__ do?

I am diving into the SciPy LinAlg module for the first time, and I saw this function:def _makearray(a):new = asarray(a)wrap = getattr(a, "__array_prepare__", new.__array_wrap__)return new, wr…

SqlAlchemy TIMESTAMP on update extra

I am using SqlAlchemy on python3.4.3 to manage a MySQL database. I was creating a table with:from datetime import datetimefrom sqlalchemy import Column, text, create_engine from sqlalchemy.types import…

Is it possible to pass a dictionary with extraneous elements to a Django object.create method?

I am aware that when using MyModel.objects.create in Django, it is possible to pass in a dictionary with keys which correspond to the model fields in MyModel. This is explained in another question here…

When should I use varargs in designing a Python API?

Is there a good rule of thumb as to when you should prefer varargs function signatures in your API over passing an iterable to a function? ("varargs" being short for "variadic" or …

PyPDF2 wont extract all text from PDF

Im trying to extract text from a PDF (https://www.sec.gov/litigation/admin/2015/34-76574.pdf) using PyPDF2, and the only result Im getting is the following string:bHere is my code:import PyPDF2 import …