I have a list of tokens extracted out of a pdf source. I am able to pre process the text and tokenize it but I want to loop through the tokens and convert each token in the list to its lemma in the wordnet corpus. So, my tokens list looks like this:
['0000', 'Everyone', 'age', 'remembers', 'Þ', 'rst', 'heard', 'contest', 'I', 'sitting', 'hideout', 'watching', ...]
There's no lemmas of words like 'Everyone', '0000', 'Þ' and many more which I need to eliminate. But for words like 'age', 'remembers', 'heard' etc. the token list is suppose to look like:
['age', 'remember', 'hear', ...]
I am checking the synonyms through this code:
syns = wn.synsets("heard")
print(syns[0].lemmas()[0].name())
At this point I have created the function clean_text() in python for preprocessing. That looks like:
def clean_text(text):# Eliminating punctuationstext = "".join([word for word in text if word not in string.punctuation])# tokenizingtokens = re.split("\W+", text)# lemmatizing and removing stopwordstext = [wn.lemmatize(word) for word in tokens if word not in stopwords]# converting token list into synsetsyns = [text.lemmas()[0].name() for text in wn.synsets(text)]return text
I am getting the error :
syns = [text.lemmas()[0].name() for text in wn.synsets(text)]
AttributeError: 'list' object has no attribute 'lower'
How to get the token list for each lemma?
The full code:
import string
import re
from wordcloud import WordCloud
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.corpus import wordnet
import PyPDF4
import matplotlib
import numpy as np
from PIL import Imagestopwords = nltk.corpus.stopwords.words('english')
moreStopwords = ['clin97803078874365pallr1indd'] # additional stopwords to be removed manually.
wn = nltk.WordNetLemmatizer()data = PyPDF4.PdfFileReader(open('ReadyPlayerOne.pdf', 'rb'))
pageData = ''
for page in data.pages:pageData += page.extractText()
# print(pageData)def clean_text(text):text = "".join([word for word in text if word not in string.punctuation])tokens = re.split("\W+", text)text = [wn.lemmatize(word) for word in tokens if word not in stopwords]syns = [text.lemmas()[0].name() for text in wordnet.synsets(text)]return synsprint(clean_text(pageData))