AttributeError: list object has no attribute lower : clustering

2024/11/18 1:48:46

I'm trying to do a clustering. I'm doing with pandas and sklearn.

import pandas
import pprint
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.feature_extraction.text import TfidfVectorizerdataset = pandas.read_csv('text.csv', encoding='utf-8')dataset_list = dataset.values.tolist()vectors = TfidfVectorizer()
X = vectors.fit_transform(dataset_list)clusters_number = 20model = KMeans(n_clusters = clusters_number, init = 'k-means++', max_iter = 300, n_init = 1)model.fit(X)centers = model.cluster_centers_
labels = model.labels_clusters = {}
for comment, label in zip(dataset_list, labels):print ('Comment:', comment)print ('Label:', label)try:clusters[str(label)].append(comment)
except:clusters[str(label)] = [comment]
pprint.pprint(clusters)

But I have the following error, even though I have never used the lower():

File "clustering.py", line 19, in <module>X = vetorizer.fit_transform(dataset_list)File "/usr/lib/python3/dist-packages/sklearn/feature_extraction/text.py", line 1381, in fit_transformX = super(TfidfVectorizer, self).fit_transform(raw_documents)File "/usr/lib/python3/dist-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform
self.fixed_vocabulary_)File "/usr/lib/python3/dist-packages/sklearn/feature_extraction/text.py", line 792, in _count_vocab
for feature in analyze(doc):File "/usr/lib/python3/dist-packages/sklearn/feature_extraction/text.py", line 266, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)File "/usr/lib/python3/dist-packages/sklearn/feature_extraction/text.py", line 232, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'

I don't understand, my text (text.csv) is already lowercase. And I at no time called lower()

Data:

hello wish to cancel order thank you confirmation

hello would like to cancel order made today store house world

dimensions bed not compatible would like to know how to pass cancellation refund send today cordially

hello possible cancel order cordially

hello wants to cancel order request refund

hello wish to cancel this order can indicate process cordially

hello seen date delivery would like to cancel order thank you

hello wants to cancel matching order good delivery n ° 111111

hi would like to cancel this order

hello order product store cancel act doublon advance thank you cordially

hello wishes to cancel order thank you kindly refund greetings

hello possible cancel order please thank you in advance forward cordially

Answer

The error is in this line:

dataset_list = dataset.values.tolist()

You see, dataset is a pandas DataFrame, so when you do dataset.values, it will be converted to a 2-d dataset of shape (n_rows, 1) (Even if the number of columns are 1). Then calling tolist() on this will result in a list of lists, something like this:

print(dataset_list)[[hello wish to cancel order thank you confirmation],[hello would like to cancel order made today store house world],[dimensions bed not compatible would like to know how to pass cancellation refund send today cordially].........]]

As you see, there are two square brackets here.

Now TfidfVectorizer only requires a list of sentences, not lists of list and hence the error (because TfidfVectorizer assumes internal data to be sentences, but here it is a list).

So you just need to do this:

# Use ravel to convert 2-d to 1-d array
dataset_list = dataset.values.ravel().tolist()

OR

# Replace `column_name` with your actual column header, 
# which converts DataFrame to Series
dataset_list = dataset['column_name'].values).tolist()
https://en.xdnf.cn/q/118733.html

Related Q&A

I’m dealing with an error when I run server in Django

PS C:\Users\besho\OneDrive\Desktop\DjangoCrushcourse> python manage.py runserver C:\Users\besho\AppData\Local\Programs\Python\Python312\python.exe: cant open file C:\Users\besho\OneDrive\Desktop\Dja…

python threading with global variables

i encountered a problem when write python threading code, that i wrote some workers threading classes, they all import a global file like sharevar.py, i need a variable like regdevid to keep tracking t…

How to write nth value of list into csv file

i have the following list : sec_min=[37, 01, 37, 02, 37, 03, 37, 04, 37, 05,....]i want to store this list into CVS file in following format: 37,01 37,02 37,03 37,04 ... and so onthis is what i coded: …

Read R function output as columns

Im trying to come up with a way to solve this question I asked yesterday:rpy2 fails to import rgl R packageMy goal is to check if certain packages are installed inside R from within python.Following th…

How does .split() work? - Python

In the following examples, I am splitting an empty string by a space. However, in the first example I explicitly used a space and in the second example, I didnt. My understanding was that .split() and …

How to get email.Header.decode_header to work with non-ASCII characters?

Im borrowing the following code to parse email headers, and additionally to add a header further down the line. Admittedly, I dont fully understand the reason for all the scaffolding around what should…

Elif syntax error in Python

This is my code for a if/elif/else conditional for a text-based adventure game Im working on in Python. The goal of this section is to give the player options on what to do, but it says there is someth…

Convert date from dd-mm-yy to dd-mm-yyyy using python [duplicate]

This question already has answers here:How to parse string dates with 2-digit year?(6 answers)Closed 7 years ago.I have a date input date_dob which is 20-Apr-53 I tried converting this to format yyyy…

sklearn pipeline transform ValueError that Expected Value is not equal to Trained Value

Can you please help me to with the following function where I got the error of ValueError: Column ordering must be equal for fit and for transform when using the remainder keyword(The function is calle…

How to show Chinese characters in Matplotlib graphs?

I want to make a graph based on a data frame that has a column with Chinese characters. But the characters wont show on the graph, and I received this error. C:\Users\march\anaconda3\lib\site-packages\…