I am trying to tokenize each sentence of my pandas
series.
I try to do as I see in the documentation, using apply, but didn't work:
x.apply(nltk.word_tokenize)
If I just use nltk.word_tokenize(x)
didn't work too, because x
is not a string. Does someone have any idea?
Edited: x
is a pandas
series with sentences:
0 A very, very, very slow-moving, aimless movie ...
1 Not sure who was more lost - the flat characte...
2 Attempting artiness with black & white and cle...
With x.apply(nltk.word_tokenize)
it returns exactly the same:
0 A very, very, very slow-moving, aimless movie ...
1 Not sure who was more lost - the flat characte...
2 Attempting artiness with black & white and cle...
With nltk.word_tokenize(x)
the error is:
TypeError: expected string or bytes-like object
Question: are you saving your intermediate results? x.apply()
creates a copy of your original Series
with the appropriate transformations applied to each element of the Series
. See below for an example of how this might be affecting your code...
We'll start by confirming that word_tokenize()
works on a sample snippet of text.
>>> import pandas as pd
>>> from nltk import word_tokenize
>>> word_tokenize('hello how are you') # confirming that word_tokenize works.
['hello', 'how', 'are', 'you']
Then let's create a Series
to play with.
>>> s = pd.Series(['hello how are you','lorem ipsum isumming lorems','more stuff in a line'])>>> print(s)
0 hello how are you
1 lorem ipsum isumming lorems
2 more stuff in a line
dtype: object
Executing word_tokenize
using the apply()
function on an interactive Python prompt shows that it tokenizes...
But doesn't indicate that this is a copy... not a permanent change to s
>>> s.apply(word_tokenize)
0 [hello, how, are, you]
1 [lorem, ipsum, isumming, lorems]
2 [more, stuff, in, a, line]
dtype: object
In fact, we can print s
to show that it is unchanged...
>>> print(s)
0 hello how are you
1 lorem ipsum isumming lorems
2 more stuff in a line
dtype: object
If, instead, we supply a label, in this case wt
to the results of the apply()
function call it allows us to save the results permanently. Which we can see by printing wt
.
>>> wt = s.apply(word_tokenize)
>>> print(wt)
0 [hello, how, are, you]
1 [lorem, ipsum, isumming, lorems]
2 [more, stuff, in, a, line]
dtype: object
Doing this on an interactive prompt allows us to more easily detect such a condition, but running it in a script sometimes means that the fact that a copy was produced will pass silently and without indication.