Generate N-Grams from strings with pandas

2024/9/22 5:29:36

I have a DataFrame df like this:

Pattern    String                                       
101        hi, how are you?
104        what are you doing?
108        Python is good to learn.

I want to create ngrams for String Column. I've create unigram using split() and stack()

new= df.String.str.split(expand=True).stack()

However, I want to create ngrams (bi, tri, quad etc)

Answer

Do a little preprocessing on your text column, and then a little shifting + concatenation:

# generate unigrams 
unigrams  = (df['String'].str.lower().str.replace(r'[^a-z\s]', '').str.split(expand=True).stack())# generate bigrams by concatenating unigram columns
bigrams = unigrams + ' ' + unigrams.shift(-1)
# generate trigrams by concatenating unigram and bigram columns
trigrams = bigrams + ' ' + unigrams.shift(-2)# concatenate all series vertically, and remove NaNs
pd.concat([unigrams, bigrams, trigrams]).dropna().reset_index(drop=True)

0                   hi
1                  how
2                  are
3                  you
4                 what
5                  are
6                  you
7                doing
8               python
9                   is
10                good
11                  to
12               learn
13              hi how
14             how are
15             are you
16            you what
17            what are
18             are you
19           you doing
20        doing python
21           python is
22             is good
23             good to
24            to learn
25          hi how are
26         how are you
27        are you what
28        you what are
29        what are you
30       are you doing
31    you doing python
32     doing python is
33      python is good
34          is good to
35       good to learn
dtype: object
https://en.xdnf.cn/q/71981.html

Related Q&A

Merge dataframes on multiple columns with fuzzy match in Python

I have two example dataframes as follows:df1 = pd.DataFrame({Name: {0: John, 1: Bob, 2: Shiela}, Degree: {0: Masters, 1: Graduate, 2: Graduate}, Age: {0: 27, 1: 23, 2: 21}}) df2 = pd.DataFrame({Name: {…

Prevent Celery Beat from running the same task

I have a scheduled celery running tasks every 30 seconds. I have one that runs as task daily, and another one that runs weekly on a user specified time and day of the week. It checks for the "star…

Tastypie with application/x-www-form-urlencoded

Im having a bit of difficulty figuring out what my next steps should be. I am using tastypie to create an API for my web application. From another application, specifically ifbyphone.com, I am receivin…

Check for areas that are too thin in an image

I am trying to validate black and white images (more of a clipart images - not photos) for an engraving machine. One of the major things I need to take into consideration is the size of areas (or width…

Sort Python Dictionary by Absolute Value of Values

Trying to build off of the advice on sorting a Python dictionary here, how would I go about printing a Python dictionary in sorted order based on the absolute value of the values?I have tried:sorted(m…

impyla hangs when connecting to HiveServer2

Im writing some ETL flows in Python that, for part of the process, use Hive. Clouderas impyla client, according to the documentation, works with both Impala and Hive.In my experience, the client worked…

django prevent delete of model instance

I have a models.Model subclass which represents a View on my mysql database (ie managed=False).However, when running my unit tests, I get:DatabaseError: (1288, The target table my_view_table of the DEL…

suppress/redirect stderr when calling python webrowser

I have a python program that opens several urls in seperate tabs in a new browser window, however when I run the program from the command line and open the browser using webbrowser.open_new(url)The std…

Bokeh logarithmic scale for Bar chart

I know that I can do logarithmic scales with bokeh using the plotting API:p = figure(tools="pan,box_zoom,reset,previewsave",y_axis_type="log", y_range=[0.001, 10**22], title="l…

Can I control the way the CountVectorizer vectorizes the corpus in scikit learn?

I am working with a CountVectorizer from scikit learn, and Im possibly attempting to do some things that the object was not made for...but Im not sure.In terms of getting counts for occurrence:vocabula…