I have the following data generated from a groupby('Datetime')
and value_counts()
Datetime 0
01/01/2020 Paul 803 2
01/02/2020 Paul 210982360967 1
01/03/2020 religion 3..
02/28/2020 l 18
02/29/2020 Paul 78march 22
03/01/2020 church 63l 21
I would like to remove a specific name (in this case I would like to remove 'Paul') and all the numbers (03, 10982360967 in this specific example). I do not know why there is a character 'l' as I had tried to remove stopwords including alphabet (and numbers).
Do you know how I could further clean this selection?
Expected output to avoid confusion:
Datetime 0
01/03/2020 religion 3..
02/29/2020 march 22
03/01/2020 church 63
I removed Paul, 03, 109..., and l.
Raw data:
Datetime Corpus
01/03/2020 Paul: examples of religion
01/03/2020 Paul:shinto is a religion 03
01/03/2020 don't talk to me about religion, Paul 03
...
02/29/2020 march is the third month of the year 10982360967
02/29/2020 during march, there are some cold days.
...
03/01/2020 she is at church right now
...
I cannot put all the raw data as I have more than 100 sentences.
The code I used is:
df.Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)
Since I got a Key error, I had to edit the code as follows:
df.set_index('Datetime').Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)
To extract the words I used str.extractall
Cleaning strings is a multi-step process
Create dataframe
import pandas as pd
from nltk.corpus import stopwords
import string# data and dataframe
data = {'Datetime': ['01/03/2020', '01/03/2020', '01/03/2020', '02/29/2020', '02/29/2020', '03/01/2020'],'Corpus': ['Paul: Examples of religion','Paul:shinto is a religion 03',"don't talk to me about religion, Paul 03",'march is the third month of the year 10982360967','during march, there are some cold days.','she is at church right now']}test = pd.DataFrame(data)
test.Datetime = pd.to_datetime(test.Datetime)| | Datetime | Corpus |
|---:|:--------------------|:-------------------------------------------------|
| 0 | 2020-01-03 00:00:00 | Paul: Examples of religion |
| 1 | 2020-01-03 00:00:00 | Paul:shinto is a religion 03 |
| 2 | 2020-01-03 00:00:00 | don't talk to me about religion, Paul 03 |
| 3 | 2020-02-29 00:00:00 | march is the third month of the year 10982360967 |
| 4 | 2020-02-29 00:00:00 | during march, there are some cold days. |
| 5 | 2020-03-01 00:00:00 | she is at church right now |
Clean Corpus
- Add extra words to the
remove_words
list
- Some cleaning steps could be combined, but I do not recommend that
- Step-by-step makes it easier to determine if you've made a mistake
- This is a small example of text cleaning.
- There are entire books on the subject.
- There's not context analysis
example = 'We march to the church in March.'
value_count
for 'march'
in example.lower()
is 2
# words to remove
remove_words = list(stopwords.words('english'))
# extra words to remove
additional_remove_words = ['paul', 'shinto', 'examples', 'talk', 'third', 'month', 'year', 'cold', 'days', 'right']
remove_words.extend(additional_remove_words) # add other words to exclude in lowercase# punctuation to remove
punctuation = string.punctuation
punc = r'[{}]'.format(punctuation)test.dropna(inplace=True) # drop any na rows# clean text now
test.Corpus = test.Corpus.str.replace('\d+', '') # remove numberstest.Corpus = test.Corpus.str.replace(punc, ' ') # remove punctuation test.Corpus = test.Corpus.str.replace('\\s+', ' ') # remove occurrences of more than one whitespacetest.Corpus = test.Corpus.str.strip() # remove whitespace from beginning and end of stringtest.Corpus = test.Corpus.str.lower() # convert all to lowercasetest.Corpus = test.Corpus.apply(lambda x: list(word for word in x.split() if word not in remove_words)) # remove words| | Datetime | Corpus |
|---:|:--------------------|:-------------|
| 0 | 2020-01-03 00:00:00 | ['religion'] |
| 1 | 2020-01-03 00:00:00 | ['religion'] |
| 2 | 2020-01-03 00:00:00 | ['religion'] |
| 3 | 2020-02-29 00:00:00 | ['march'] |
| 4 | 2020-02-29 00:00:00 | ['march'] |
| 5 | 2020-03-01 00:00:00 | ['church'] |
Explode Corpus
& groupby
# explode list
test = test.explode('Corpus')# dropna incase there are empty rows from filtering
test.dropna(inplace=True)# groupby
test.groupby('Datetime').agg({'Corpus': 'value_counts'}).rename(columns={'Corpus': 'word_count'})word_count
Datetime Corpus
2020-01-03 religion 3
2020-02-29 march 2
2020-03-01 church 1