We are working on sentences extracted from a PDF. The problem is that it includes the title, footers, table of contents, etc. Is there a way to determine if the sentence we get when pass the document to spacy is a complete sentence. Is there a way to filter parts of sentences like titles?
A complete sentence contains at least one subject, one predicate, one object, and closes with punctuation.
Subject and object are almost always nouns, and the predicate is always a verb.
Thus you need to check if your sentence contains two nouns, one verb and closes with punctuation:
import spacynlp = spacy.load("en_core_web_sm")
doc = nlp("I. Introduction\nAlfred likes apples! A car runs over a red light.")
for sent in doc.sents:if sent[0].is_title and sent[-1].is_punct:has_noun = 2has_verb = 1for token in sent:if token.pos_ in ["NOUN", "PROPN", "PRON"]:has_noun -= 1elif token.pos_ == "VERB":has_verb -= 1if has_noun < 1 and has_verb < 1:print(sent.string.strip())
Update
I also would advise to check if the sentence starts with an upper case letter, I added the modification in the code. Furthermore, I would like to point out that what I wrote is true for English and German, I don't know how it is in other languages.