Question 1

I have a list of tokens extracted out of a pdf source. I am able to pre process the text and tokenize it but I want to loop through the tokens and convert each token in the list to its lemma in the wordnet corpus. So, my tokens list looks like this:

['0000', 'Everyone', 'age', 'remembers', 'Þ', 'rst', 'heard', 'contest', 'I', 'sitting', 'hideout', 'watching', ...]

There's no lemmas of words like 'Everyone', '0000', 'Þ' and many more which I need to eliminate. But for words like 'age', 'remembers', 'heard' etc. the token list is suppose to look like:

['age', 'remember', 'hear', ...]

I am checking the synonyms through this code:

syns = wn.synsets("heard")
print(syns[0].lemmas()[0].name())

At this point I have created the function clean_text() in python for preprocessing. That looks like:

def clean_text(text):# Eliminating punctuationstext = "".join([word for word in text if word not in string.punctuation])# tokenizingtokens = re.split("\W+", text)# lemmatizing and removing stopwordstext = [wn.lemmatize(word) for word in tokens if word not in stopwords]# converting token list into synsetsyns = [text.lemmas()[0].name() for text in wn.synsets(text)]return text

I am getting the error :

syns = [text.lemmas()[0].name() for text in wn.synsets(text)]
AttributeError: 'list' object has no attribute 'lower'

How to get the token list for each lemma?

The full code:

import string
import re
from wordcloud import WordCloud
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.corpus import wordnet
import PyPDF4
import matplotlib
import numpy as np
from PIL import Imagestopwords = nltk.corpus.stopwords.words('english')
moreStopwords = ['clin97803078874365pallr1indd'] # additional stopwords to be removed manually.
wn = nltk.WordNetLemmatizer()data = PyPDF4.PdfFileReader(open('ReadyPlayerOne.pdf', 'rb'))
pageData = ''
for page in data.pages:pageData += page.extractText()
# print(pageData)def clean_text(text):text = "".join([word for word in text if word not in string.punctuation])tokens = re.split("\W+", text)text = [wn.lemmatize(word) for word in tokens if word not in stopwords]syns = [text.lemmas()[0].name() for text in wordnet.synsets(text)]return synsprint(clean_text(pageData))

Question 2

You are calling wordnet.synsets(text) with a list of words (check what is text at that point) and you should call it with a word. The preprocessing of wordnet.synsets is trying to apply .lower() to its parameters and therefore the error (AttributeError: 'list' object has no attribute 'lower').

Below there is a functional version of clean_text with a fix of this problem:

import string
import re
import nltk
from nltk.corpus import wordnetstopwords = nltk.corpus.stopwords.words('english')
wn = nltk.WordNetLemmatizer()def clean_text(text):text = "".join([word for word in text if word not in string.punctuation])tokens = re.split("\W+", text)text = [wn.lemmatize(word) for word in tokens if word not in stopwords]lemmas = []for token in text:lemmas += [synset.lemmas()[0].name() for synset in wordnet.synsets(token)]return lemmastext = "The grass was greener."print(clean_text(text))

Returns:

['grass', 'Grass', 'supergrass', 'eatage', 'pot', 'grass', 'grass', 'grass', 'grass', 'grass', 'denounce', 'green', 'green', 'green', 'green', 'fleeceable']

How to convert token list into wordnet lemma list using nltk?

Related Q&A

Script throws an error when it is made to run using multiprocessing

Efficiently pair random elements of list

ALL permutations of a list with repetition but not doubles

NameError: name current_portfolio is not defined

Scrape an Ajax form with .submit() with Python and Selenium

How to process break an array in Python?

Why am I getting replacement index 1 out of range for positional args tuple error

Python: Find keywords in a text file from another text file

How to split a list into chucks of different sizes specified by another list? [duplicate]

Python Sum of digits in a string function