How to clean a string to get value_counts for words of interest by date?

2024/10/7 20:30:40

I have the following data generated from a groupby('Datetime') and value_counts()

Datetime        0          
01/01/2020  Paul            803              2
01/02/2020  Paul            210982360967     1
01/03/2020  religion        3..
02/28/2020  l              18
02/29/2020  Paul           78march          22
03/01/2020  church         63l              21

I would like to remove a specific name (in this case I would like to remove 'Paul') and all the numbers (03, 10982360967 in this specific example). I do not know why there is a character 'l' as I had tried to remove stopwords including alphabet (and numbers). Do you know how I could further clean this selection?

Expected output to avoid confusion:

Datetime        0          
01/03/2020  religion        3..
02/29/2020  march          22
03/01/2020  church         63

I removed Paul, 03, 109..., and l.

Raw data:

Datetime        Corpus          
01/03/2020      Paul: examples of religion
01/03/2020      Paul:shinto is a religion 03
01/03/2020      don't talk to me about religion, Paul 03
02/29/2020     march is the third month of the year 10982360967
02/29/2020     during march, there are some cold days.
03/01/2020     she is at church right now

I cannot put all the raw data as I have more than 100 sentences.

The code I used is:


Since I got a Key error, I had to edit the code as follows:


To extract the words I used str.extractall


Cleaning strings is a multi-step process

Create dataframe

import pandas as pd
from nltk.corpus import stopwords
import string# data and dataframe
data = {'Datetime': ['01/03/2020', '01/03/2020', '01/03/2020', '02/29/2020', '02/29/2020', '03/01/2020'],'Corpus': ['Paul: Examples of religion','Paul:shinto is a religion 03',"don't talk to me about religion, Paul 03",'march is the third month of the year 10982360967','during march, there are some cold days.','she is at church right now']}test = pd.DataFrame(data)
test.Datetime = pd.to_datetime(test.Datetime)|    | Datetime            | Corpus                                           |
|  0 | 2020-01-03 00:00:00 | Paul: Examples of religion                       |
|  1 | 2020-01-03 00:00:00 | Paul:shinto is a religion 03                     |
|  2 | 2020-01-03 00:00:00 | don't talk to me about religion, Paul 03         |
|  3 | 2020-02-29 00:00:00 | march is the third month of the year 10982360967 |
|  4 | 2020-02-29 00:00:00 | during march, there are some cold days.          |
|  5 | 2020-03-01 00:00:00 | she is at church right now                       |

Clean Corpus

  • Add extra words to the remove_words list
    • They should be lowercase
  • Some cleaning steps could be combined, but I do not recommend that
    • Step-by-step makes it easier to determine if you've made a mistake
  • This is a small example of text cleaning.
    • There are entire books on the subject.
    • There's not context analysis
      • example = 'We march to the church in March.'
      • value_count for 'march' in example.lower() is 2
# words to remove
remove_words = list(stopwords.words('english'))
# extra words to remove
additional_remove_words = ['paul', 'shinto', 'examples', 'talk', 'third', 'month', 'year', 'cold', 'days', 'right']
remove_words.extend(additional_remove_words)  # add other words to exclude in lowercase# punctuation to remove
punctuation = string.punctuation
punc = r'[{}]'.format(punctuation)test.dropna(inplace=True)  # drop any na rows# clean text now
test.Corpus = test.Corpus.str.replace('\d+', '')  # remove numberstest.Corpus = test.Corpus.str.replace(punc, ' ')  # remove punctuation test.Corpus = test.Corpus.str.replace('\\s+', ' ')  # remove occurrences of more than one whitespacetest.Corpus = test.Corpus.str.strip()  # remove whitespace from beginning and end of stringtest.Corpus = test.Corpus.str.lower()  # convert all to lowercasetest.Corpus = test.Corpus.apply(lambda x: list(word for word in x.split() if word not in remove_words))  # remove words|    | Datetime            | Corpus       |
|  0 | 2020-01-03 00:00:00 | ['religion'] |
|  1 | 2020-01-03 00:00:00 | ['religion'] |
|  2 | 2020-01-03 00:00:00 | ['religion'] |
|  3 | 2020-02-29 00:00:00 | ['march']    |
|  4 | 2020-02-29 00:00:00 | ['march']    |
|  5 | 2020-03-01 00:00:00 | ['church']   |

Explode Corpus & groupby

# explode list
test = test.explode('Corpus')# dropna incase there are empty rows from filtering
test.dropna(inplace=True)# groupby
test.groupby('Datetime').agg({'Corpus': 'value_counts'}).rename(columns={'Corpus': 'word_count'})word_count
Datetime   Corpus              
2020-01-03 religion           3
2020-02-29 march              2
2020-03-01 church             1

