How to clean a string to get value_counts for words of interest by date?

2024/10/7 20:30:40

I have the following data generated from a groupby('Datetime') and value_counts()

Datetime        0          
01/01/2020  Paul            803              2
01/02/2020  Paul            210982360967     1
01/03/2020  religion        3..
02/28/2020  l              18
02/29/2020  Paul           78march          22
03/01/2020  church         63l              21

I would like to remove a specific name (in this case I would like to remove 'Paul') and all the numbers (03, 10982360967 in this specific example). I do not know why there is a character 'l' as I had tried to remove stopwords including alphabet (and numbers). Do you know how I could further clean this selection?

Expected output to avoid confusion:

Datetime        0          
01/03/2020  religion        3..
02/29/2020  march          22
03/01/2020  church         63

I removed Paul, 03, 109..., and l.

Raw data:

Datetime        Corpus          
01/03/2020      Paul: examples of religion
01/03/2020      Paul:shinto is a religion 03
01/03/2020      don't talk to me about religion, Paul 03
...
02/29/2020     march is the third month of the year 10982360967
02/29/2020     during march, there are some cold days.
...
03/01/2020     she is at church right now
...

I cannot put all the raw data as I have more than 100 sentences.

The code I used is:

df.Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)

Since I got a Key error, I had to edit the code as follows:

df.set_index('Datetime').Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)

To extract the words I used str.extractall

Answer

Cleaning strings is a multi-step process

Create dataframe

import pandas as pd
from nltk.corpus import stopwords
import string# data and dataframe
data = {'Datetime': ['01/03/2020', '01/03/2020', '01/03/2020', '02/29/2020', '02/29/2020', '03/01/2020'],'Corpus': ['Paul: Examples of religion','Paul:shinto is a religion 03',"don't talk to me about religion, Paul 03",'march is the third month of the year 10982360967','during march, there are some cold days.','she is at church right now']}test = pd.DataFrame(data)
test.Datetime = pd.to_datetime(test.Datetime)|    | Datetime            | Corpus                                           |
|---:|:--------------------|:-------------------------------------------------|
|  0 | 2020-01-03 00:00:00 | Paul: Examples of religion                       |
|  1 | 2020-01-03 00:00:00 | Paul:shinto is a religion 03                     |
|  2 | 2020-01-03 00:00:00 | don't talk to me about religion, Paul 03         |
|  3 | 2020-02-29 00:00:00 | march is the third month of the year 10982360967 |
|  4 | 2020-02-29 00:00:00 | during march, there are some cold days.          |
|  5 | 2020-03-01 00:00:00 | she is at church right now                       |

Clean Corpus

  • Add extra words to the remove_words list
    • They should be lowercase
  • Some cleaning steps could be combined, but I do not recommend that
    • Step-by-step makes it easier to determine if you've made a mistake
  • This is a small example of text cleaning.
    • There are entire books on the subject.
    • There's not context analysis
      • example = 'We march to the church in March.'
      • value_count for 'march' in example.lower() is 2
# words to remove
remove_words = list(stopwords.words('english'))
# extra words to remove
additional_remove_words = ['paul', 'shinto', 'examples', 'talk', 'third', 'month', 'year', 'cold', 'days', 'right']
remove_words.extend(additional_remove_words)  # add other words to exclude in lowercase# punctuation to remove
punctuation = string.punctuation
punc = r'[{}]'.format(punctuation)test.dropna(inplace=True)  # drop any na rows# clean text now
test.Corpus = test.Corpus.str.replace('\d+', '')  # remove numberstest.Corpus = test.Corpus.str.replace(punc, ' ')  # remove punctuation test.Corpus = test.Corpus.str.replace('\\s+', ' ')  # remove occurrences of more than one whitespacetest.Corpus = test.Corpus.str.strip()  # remove whitespace from beginning and end of stringtest.Corpus = test.Corpus.str.lower()  # convert all to lowercasetest.Corpus = test.Corpus.apply(lambda x: list(word for word in x.split() if word not in remove_words))  # remove words|    | Datetime            | Corpus       |
|---:|:--------------------|:-------------|
|  0 | 2020-01-03 00:00:00 | ['religion'] |
|  1 | 2020-01-03 00:00:00 | ['religion'] |
|  2 | 2020-01-03 00:00:00 | ['religion'] |
|  3 | 2020-02-29 00:00:00 | ['march']    |
|  4 | 2020-02-29 00:00:00 | ['march']    |
|  5 | 2020-03-01 00:00:00 | ['church']   |

Explode Corpus & groupby

# explode list
test = test.explode('Corpus')# dropna incase there are empty rows from filtering
test.dropna(inplace=True)# groupby
test.groupby('Datetime').agg({'Corpus': 'value_counts'}).rename(columns={'Corpus': 'word_count'})word_count
Datetime   Corpus              
2020-01-03 religion           3
2020-02-29 march              2
2020-03-01 church             1
https://en.xdnf.cn/q/118784.html

Related Q&A

Folium - Map doesnt appear

I try to get map through Folium but only thing I can see is marker on blank page. Id like to know where is problem lies, in explorer or coding. map.py import foliummap = folium.Map(location = [46.20, 6…

python tkinter exe built with cx_Freeze for windows wont show GUI

PROBLEM SOLVED. the issue was with jaraco module, that i used for clipboard manipulation, i used pyperclip instead.I made a python app with tkinter that works fine, but I wanted to make an exe from it …

lxml tree connection and properties

I have a .dtsx file so, I have multiple components with connections, so I need to extract component that have especific connection, but I can not handle that, example: <components><component r…

Python recursive function call with if statement

I have a question regarding function-calls using if-statements and recursion. I am a bit confused because python seems to jump into the if statements block even if my function returns "False"…

How can I list all 1st row values in an Excel spreadsheet using OpenPyXL?

Using the OpenPyXL module with Python 3.5, I was able to figure out how many columns there are in a spreadsheet with:In [1]: sheet.max_column Out [1]: 4Then I was able to list the values in each of the…

Using matplotlib on non-0 MPI rank causes QXcbConnection: Could not connect to display

I have written a program that uses mpi4py to do some job (making an array) in the node of rank 0 in the following code. Then it makes another array in the node of rank 1. Then I plot both the arrays. T…

ioerror errno 13 permission denied: C:\\pagefile.sys

Below is my code, what I am trying to achieve is walking through the OS generating a MD5 hash of every file the code is functional, however, I receive the error in the title "ioerror errno 13 perm…

How can PyUSB be understood? [closed]

Its difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying thi…

Resize image in python without using resize() - nearest neighbor

For an assignment I want to resize a .jpg image with a python code, but without using the pil.image.resize() function or another similar function. I want to write the code myself but I cant figure out …

Concatenate two dataframes based on no of rows

I have two dataframes:a b c d e f 2 4 6 6 7 1 4 7 9 9 5 87 9 65 8 2Now I want to create a new dataframe like this:a b c d e f 2 4 6 6 7 1 4 7 9 9 5 8 That is, I only want the rows of the …