Question 1

We are working on sentences extracted from a PDF. The problem is that it includes the title, footers, table of contents, etc. Is there a way to determine if the sentence we get when pass the document to spacy is a complete sentence. Is there a way to filter parts of sentences like titles?

Question 2

A complete sentence contains at least one subject, one predicate, one object, and closes with punctuation. Subject and object are almost always nouns, and the predicate is always a verb.

Thus you need to check if your sentence contains two nouns, one verb and closes with punctuation:

import spacynlp = spacy.load("en_core_web_sm")
doc = nlp("I. Introduction\nAlfred likes apples! A car runs over a red light.")
for sent in doc.sents:if sent[0].is_title and sent[-1].is_punct:has_noun = 2has_verb = 1for token in sent:if token.pos_ in ["NOUN", "PROPN", "PRON"]:has_noun -= 1elif token.pos_ == "VERB":has_verb -= 1if has_noun < 1 and has_verb < 1:print(sent.string.strip())

Update

I also would advise to check if the sentence starts with an upper case letter, I added the modification in the code. Furthermore, I would like to point out that what I wrote is true for English and German, I don't know how it is in other languages.

Determine if a text extract from spacy is a complete sentence

Related Q&A

Drawing labels that follow their edges in a Networkx graph

randomly choose 100 documents under a directory

Oauth client initialization in python for tumblr API using Python-oauth2

Model description in django-admin

Print underscore separated integer

What does (numpy) __array_wrap__ do?

SqlAlchemy TIMESTAMP on update extra

Is it possible to pass a dictionary with extraneous elements to a Django object.create method?

When should I use varargs in designing a Python API?

PyPDF2 wont extract all text from PDF