Tokenise text and create more rows for each row in dataframe

2024/10/7 20:25:35

I want to do this with python and pandas.

Let's suppose that I have the following:

file_id   text
1         I am the first document. I am a nice document.
2         I am the second document. I am an even nicer document.

and I finally want to have the following:

file_id   text
1         I am the first document
1         I am a nice document
2         I am the second document
2         I am an even nicer document

So I want the text of each file to be splitted at every fullstop and to create new lines for each of the tokens of these texts.

What is the most efficient way to do this?

Answer

Use:

s = (df.pop('text').str.strip('.').str.split('\.\s+', expand=True).stack().rename('text').reset_index(level=1, drop=True))df = df.join(s).reset_index(drop=True)
print (df)file_id                         text
0        1      I am the first document
1        1         I am a nice document
2        2     I am the second document
3        2  I am an even nicer document

Explanation:

First use DataFrame.pop for extract column, remove last . by Series.str.rstrip and split by with Series.str.split with escape . because special regex character, reshape by DataFrame.stack for Series, DataFrame.reset_index and rename for Series for DataFrame.join to original.

https://en.xdnf.cn/q/118786.html

Related Q&A

Is the example of the descriptor protocol in the Python 3.6 documentation incorrect?

I am new to Python and looking through its documentation I encountered the following example of the descriptor protocol that in my opinion is incorrect. .It looks like class IntField:def __get__(self, …

How to clean a string to get value_counts for words of interest by date?

I have the following data generated from a groupby(Datetime) and value_counts()Datetime 0 01/01/2020 Paul 803 2 01/02/2020 Paul 210982360967 1 …

Folium - Map doesnt appear

I try to get map through Folium but only thing I can see is marker on blank page. Id like to know where is problem lies, in explorer or coding. map.py import foliummap = folium.Map(location = [46.20, 6…

python tkinter exe built with cx_Freeze for windows wont show GUI

PROBLEM SOLVED. the issue was with jaraco module, that i used for clipboard manipulation, i used pyperclip instead.I made a python app with tkinter that works fine, but I wanted to make an exe from it …

lxml tree connection and properties

I have a .dtsx file so, I have multiple components with connections, so I need to extract component that have especific connection, but I can not handle that, example: <components><component r…

Python recursive function call with if statement

I have a question regarding function-calls using if-statements and recursion. I am a bit confused because python seems to jump into the if statements block even if my function returns "False"…

How can I list all 1st row values in an Excel spreadsheet using OpenPyXL?

Using the OpenPyXL module with Python 3.5, I was able to figure out how many columns there are in a spreadsheet with:In [1]: sheet.max_column Out [1]: 4Then I was able to list the values in each of the…

Using matplotlib on non-0 MPI rank causes QXcbConnection: Could not connect to display

I have written a program that uses mpi4py to do some job (making an array) in the node of rank 0 in the following code. Then it makes another array in the node of rank 1. Then I plot both the arrays. T…

ioerror errno 13 permission denied: C:\\pagefile.sys

Below is my code, what I am trying to achieve is walking through the OS generating a MD5 hash of every file the code is functional, however, I receive the error in the title "ioerror errno 13 perm…

How can PyUSB be understood? [closed]

Its difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying thi…