How to convert token list into wordnet lemma list using nltk?

2024/9/22 14:24:19

I have a list of tokens extracted out of a pdf source. I am able to pre process the text and tokenize it but I want to loop through the tokens and convert each token in the list to its lemma in the wordnet corpus. So, my tokens list looks like this:

['0000', 'Everyone', 'age', 'remembers', 'Þ', 'rst', 'heard', 'contest', 'I', 'sitting', 'hideout', 'watching', ...]

There's no lemmas of words like 'Everyone', '0000', 'Þ' and many more which I need to eliminate. But for words like 'age', 'remembers', 'heard' etc. the token list is suppose to look like:

['age', 'remember', 'hear', ...]

I am checking the synonyms through this code:

syns = wn.synsets("heard")
print(syns[0].lemmas()[0].name())

At this point I have created the function clean_text() in python for preprocessing. That looks like:

def clean_text(text):# Eliminating punctuationstext = "".join([word for word in text if word not in string.punctuation])# tokenizingtokens = re.split("\W+", text)# lemmatizing and removing stopwordstext = [wn.lemmatize(word) for word in tokens if word not in stopwords]# converting token list into synsetsyns = [text.lemmas()[0].name() for text in wn.synsets(text)]return text

I am getting the error :

syns = [text.lemmas()[0].name() for text in wn.synsets(text)]
AttributeError: 'list' object has no attribute 'lower'

How to get the token list for each lemma?

The full code:

import string
import re
from wordcloud import WordCloud
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.corpus import wordnet
import PyPDF4
import matplotlib
import numpy as np
from PIL import Imagestopwords = nltk.corpus.stopwords.words('english')
moreStopwords = ['clin97803078874365pallr1indd'] # additional stopwords to be removed manually.
wn = nltk.WordNetLemmatizer()data = PyPDF4.PdfFileReader(open('ReadyPlayerOne.pdf', 'rb'))
pageData = ''
for page in data.pages:pageData += page.extractText()
# print(pageData)def clean_text(text):text = "".join([word for word in text if word not in string.punctuation])tokens = re.split("\W+", text)text = [wn.lemmatize(word) for word in tokens if word not in stopwords]syns = [text.lemmas()[0].name() for text in wordnet.synsets(text)]return synsprint(clean_text(pageData))
Answer

You are calling wordnet.synsets(text) with a list of words (check what is text at that point) and you should call it with a word. The preprocessing of wordnet.synsets is trying to apply .lower() to its parameters and therefore the error (AttributeError: 'list' object has no attribute 'lower').

Below there is a functional version of clean_text with a fix of this problem:

import string
import re
import nltk
from nltk.corpus import wordnetstopwords = nltk.corpus.stopwords.words('english')
wn = nltk.WordNetLemmatizer()def clean_text(text):text = "".join([word for word in text if word not in string.punctuation])tokens = re.split("\W+", text)text = [wn.lemmatize(word) for word in tokens if word not in stopwords]lemmas = []for token in text:lemmas += [synset.lemmas()[0].name() for synset in wordnet.synsets(token)]return lemmastext = "The grass was greener."print(clean_text(text))

Returns:

['grass', 'Grass', 'supergrass', 'eatage', 'pot', 'grass', 'grass', 'grass', 'grass', 'grass', 'denounce', 'green', 'green', 'green', 'green', 'fleeceable']
https://en.xdnf.cn/q/119126.html

Related Q&A

Script throws an error when it is made to run using multiprocessing

Ive written a script in python in combination with BeautifulSoup to extract the title of books which get populated upon providing some ISBN numbers in amazon search box. Im providing those ISBN numbers…

Efficiently pair random elements of list

I have a list of n elements say: foo = [a, b, c, d, e] I would like to randomly pair elements of this list to receive for example: bar = [[a, c], [b, e]] where the last element will be discarded if the…

ALL permutations of a list with repetition but not doubles

I have seen similar but not the same: here. I definitely want the permutations, not combinations, of all list elements. Mine is different because itertools permutation of a,b,c returns abc but not aba …

NameError: name current_portfolio is not defined

I am getting NameError: name current_portfolio is not defineddef initialize(context): context.sym = symbol(xxx) context.i = 0def handle_data(context, data):context.i += 1 if context.i < 60:returnsma…

Scrape an Ajax form with .submit() with Python and Selenium

I am trying to get the link from a web page. The web page sends the request using javascript, then the server sends a response which goes directly to download a PDF. This new PDF is automatically downl…

How to process break an array in Python?

I would like to use a double array. But I still fail to do it. This what I did. Folder = "D:\folder" Name = [gadfg5, 546sfdgh] Ver = [None, hhdt5463]for dn in Name :for dr in Ver :if dr is No…

Why am I getting replacement index 1 out of range for positional args tuple error

I keep getting this error: Replacement index 1 out of range for positional args tuple on this line of code: print("{1}, {2}, {3}, {4}".format(question[3]), question[4], question[5], question[…

Python: Find keywords in a text file from another text file

Take this invoice.txt for exampleInvoice NumberINV-3337Order Number12345Invoice DateJanuary 25, 2016Due DateJanuary 31, 2016And this is what dict.txt looks like:Invoice DateInvoice NumberDue DateOrder …

How to split a list into chucks of different sizes specified by another list? [duplicate]

This question already has answers here:How to Split or break a Python list into Unequal chunks, with specified chunk sizes(3 answers)Closed 4 years ago.I have an array I am trying to split into chunks …

Python Sum of digits in a string function

My function needs to take in a sentence and return the sum of the numbers inside. Any advice?def sumOfDigits(sentence):sumof=0for x in sentence:if sentence.isdigit(x)== True:sumof+=int(x)return sumof