Cleaning strings is a multi-step process

Question 1

I have the following data generated from a groupby('Datetime') and value_counts()

Datetime        0          
01/01/2020  Paul            803              2
01/02/2020  Paul            210982360967     1
01/03/2020  religion        3..
02/28/2020  l              18
02/29/2020  Paul           78march          22
03/01/2020  church         63l              21

I would like to remove a specific name (in this case I would like to remove 'Paul') and all the numbers (03, 10982360967 in this specific example). I do not know why there is a character 'l' as I had tried to remove stopwords including alphabet (and numbers). Do you know how I could further clean this selection?

Expected output to avoid confusion:

Datetime        0          
01/03/2020  religion        3..
02/29/2020  march          22
03/01/2020  church         63

I removed Paul, 03, 109..., and l.

Raw data:

Datetime        Corpus          
01/03/2020      Paul: examples of religion
01/03/2020      Paul:shinto is a religion 03
01/03/2020      don't talk to me about religion, Paul 03
...
02/29/2020     march is the third month of the year 10982360967
02/29/2020     during march, there are some cold days.
...
03/01/2020     she is at church right now
...

I cannot put all the raw data as I have more than 100 sentences.

The code I used is:

df.Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)

Since I got a Key error, I had to edit the code as follows:

df.set_index('Datetime').Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)

To extract the words I used str.extractall

Question 2

Cleaning strings is a multi-step process

Create dataframe

import pandas as pd
from nltk.corpus import stopwords
import string# data and dataframe
data = {'Datetime': ['01/03/2020', '01/03/2020', '01/03/2020', '02/29/2020', '02/29/2020', '03/01/2020'],'Corpus': ['Paul: Examples of religion','Paul:shinto is a religion 03',"don't talk to me about religion, Paul 03",'march is the third month of the year 10982360967','during march, there are some cold days.','she is at church right now']}test = pd.DataFrame(data)
test.Datetime = pd.to_datetime(test.Datetime)|    | Datetime            | Corpus                                           |
|---:|:--------------------|:-------------------------------------------------|
|  0 | 2020-01-03 00:00:00 | Paul: Examples of religion                       |
|  1 | 2020-01-03 00:00:00 | Paul:shinto is a religion 03                     |
|  2 | 2020-01-03 00:00:00 | don't talk to me about religion, Paul 03         |
|  3 | 2020-02-29 00:00:00 | march is the third month of the year 10982360967 |
|  4 | 2020-02-29 00:00:00 | during march, there are some cold days.          |
|  5 | 2020-03-01 00:00:00 | she is at church right now                       |

Clean `Corpus`

Add extra words to the remove_words list
- They should be lowercase
Some cleaning steps could be combined, but I do not recommend that
- Step-by-step makes it easier to determine if you've made a mistake
This is a small example of text cleaning.
- There are entire books on the subject.
- There's not context analysis
  - example = 'We march to the church in March.'
  - value_count for 'march' in example.lower() is 2

# words to remove
remove_words = list(stopwords.words('english'))
# extra words to remove
additional_remove_words = ['paul', 'shinto', 'examples', 'talk', 'third', 'month', 'year', 'cold', 'days', 'right']
remove_words.extend(additional_remove_words)  # add other words to exclude in lowercase# punctuation to remove
punctuation = string.punctuation
punc = r'[{}]'.format(punctuation)test.dropna(inplace=True)  # drop any na rows# clean text now
test.Corpus = test.Corpus.str.replace('\d+', '')  # remove numberstest.Corpus = test.Corpus.str.replace(punc, ' ')  # remove punctuation test.Corpus = test.Corpus.str.replace('\\s+', ' ')  # remove occurrences of more than one whitespacetest.Corpus = test.Corpus.str.strip()  # remove whitespace from beginning and end of stringtest.Corpus = test.Corpus.str.lower()  # convert all to lowercasetest.Corpus = test.Corpus.apply(lambda x: list(word for word in x.split() if word not in remove_words))  # remove words|    | Datetime            | Corpus       |
|---:|:--------------------|:-------------|
|  0 | 2020-01-03 00:00:00 | ['religion'] |
|  1 | 2020-01-03 00:00:00 | ['religion'] |
|  2 | 2020-01-03 00:00:00 | ['religion'] |
|  3 | 2020-02-29 00:00:00 | ['march']    |
|  4 | 2020-02-29 00:00:00 | ['march']    |
|  5 | 2020-03-01 00:00:00 | ['church']   |

Explode `Corpus` & `groupby`

# explode list
test = test.explode('Corpus')# dropna incase there are empty rows from filtering
test.dropna(inplace=True)# groupby
test.groupby('Datetime').agg({'Corpus': 'value_counts'}).rename(columns={'Corpus': 'word_count'})word_count
Datetime   Corpus              
2020-01-03 religion           3
2020-02-29 march              2
2020-03-01 church             1

How to clean a string to get value_counts for words of interest by date?

Cleaning strings is a multi-step process

Create dataframe

Clean `Corpus`

Explode `Corpus` & `groupby`

Related Q&A

Folium - Map doesnt appear

python tkinter exe built with cx_Freeze for windows wont show GUI

lxml tree connection and properties

Python recursive function call with if statement

How can I list all 1st row values in an Excel spreadsheet using OpenPyXL?

Using matplotlib on non-0 MPI rank causes QXcbConnection: Could not connect to display

ioerror errno 13 permission denied: C:\\pagefile.sys

How can PyUSB be understood? [closed]

Resize image in python without using resize() - nearest neighbor

Concatenate two dataframes based on no of rows

How to clean a string to get value_counts for words of interest by date?

Cleaning strings is a multi-step process

Create dataframe

Clean Corpus

Explode Corpus & groupby

Related Q&A

Clean `Corpus`

Explode `Corpus` & `groupby`