Question 1

I want to do this with python and pandas.

Let's suppose that I have the following:

file_id   text
1         I am the first document. I am a nice document.
2         I am the second document. I am an even nicer document.

and I finally want to have the following:

file_id   text
1         I am the first document
1         I am a nice document
2         I am the second document
2         I am an even nicer document

So I want the text of each file to be splitted at every fullstop and to create new lines for each of the tokens of these texts.

What is the most efficient way to do this?

Question 2

Use:

s = (df.pop('text').str.strip('.').str.split('\.\s+', expand=True).stack().rename('text').reset_index(level=1, drop=True))df = df.join(s).reset_index(drop=True)
print (df)file_id                         text
0        1      I am the first document
1        1         I am a nice document
2        2     I am the second document
3        2  I am an even nicer document

Explanation:

First use DataFrame.pop for extract column, remove last . by Series.str.rstrip and split by with Series.str.split with escape . because special regex character, reshape by DataFrame.stack for Series, DataFrame.reset_index and rename for Series for DataFrame.join to original.

Tokenise text and create more rows for each row in dataframe

Related Q&A

Is the example of the descriptor protocol in the Python 3.6 documentation incorrect?

How to clean a string to get value_counts for words of interest by date?

Folium - Map doesnt appear

python tkinter exe built with cx_Freeze for windows wont show GUI

lxml tree connection and properties

Python recursive function call with if statement

How can I list all 1st row values in an Excel spreadsheet using OpenPyXL?

Using matplotlib on non-0 MPI rank causes QXcbConnection: Could not connect to display

ioerror errno 13 permission denied: C:\\pagefile.sys

How can PyUSB be understood? [closed]