cleaning big data using python

2024/9/20 12:22:34

I have to clean a input data file in python. Due to typo error, the datafield may have strings instead of numbers. I would like to identify all fields which are a string and fill these with NaN using pandas. Also, I would like to log the index of those fields.

One of the crudest way is to loop through each and every field and checking whether it is a number or not, but this consumes lot of time if the data is big.

My csv file contains data similar to the following table:

Country  Count  Sales
USA         1   65000
UK          3    4000
IND         8       g
SPA         3    9000
NTH         5   80000

.... Assume that i have 60,000 such rows in the data.

Ideally I would like to identify that row IND has an invalid value under SALES column. Any suggestions on how to do this efficiently?

Answer

There is a na_values argument to read_csv:

na_values : list-like or dict, default None
       Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values

df = pd.read_csv('city.csv', sep='\s+', na_values=['g'])In [2]: df
Out[2]:Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000

Using pandas.isnull, you can select only those rows with NaN in the 'Sales' column, or the 'Country' series:

In [3]: df[pd.isnull(df['Sales'])]
Out[3]: Country  Count  Sales
2     IND      8    NaNIn [4]: df[pd.isnull(df['Sales'])]['Country']
Out[4]: 
2    IND
Name: Country

If it's already in the DataFrame you could use apply to convert those strings which are numbers into integers (using str.isdigit):

df = pd.DataFrame({'Count': {0: 1, 1: 3, 2: 8, 3: 3, 4: 5}, 'Country': {0: 'USA', 1: 'UK', 2: 'IND', 3: 'SPA', 4: 'NTH'}, 'Sales': {0: '65000', 1: '4000', 2: 'g', 3: '9000', 4: '80000'}})In [12]: df
Out[12]: Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8      g
3     SPA      3   9000
4     NTH      5  80000In [13]: df['Sales'] = df['Sales'].apply(lambda x: int(x) if str.isdigit(x)else np.nan)In [14]: df
Out[14]: Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000
https://en.xdnf.cn/q/72170.html

Related Q&A

Using the Python shell in Vi mode on Windows

I know that you can use the Python shell in Vi mode on Unix-like operating systems. For example, I have this line in my ~/.inputrc:set editing-mode viThis lets me use Vi-style editing inside the Python…

Calculate residual deviance from scikit-learn logistic regression model

Is there any way to calculate residual deviance of a scikit-learn logistic regression model? This is a standard output from R model summaries, but I couldnt find it any of sklearns documentation.

Use Python to create 2D coordinate

I am truly a novice in Python. Now, I am doing a project which involves creating a list of 2D coordinates. The coordinates should be uniformly placed, using a square grid (10*10), like(0,0)(0,1)(0,2)(0…

How to pass Unicode title to matplotlib?

Cant get the titles right in matplotlib: technologien in C gives: technologien in CPossible solutions already tried:utechnologien in C doesnt work neither does: # -*- coding: utf-8 -*- at the beginnin…

Cythonize but not compile .pyx files using setup.py

I have a Cython project containing several .pyx files. To distribute my project I would like to provide my generated .c files as recommended in the Cython documentation, to minimize problems with diffe…

How to clear matplotlib labels in legend?

Is there a way to clear matplotlib labels inside a graphs legend? This post explains how to remove the legend itself, but the labels themselves still remain, and appear again if you plot a new figure.…

Threading and Signals problem in PyQt

Im having some problems with communicating between Threads in PyQt. Im using signals to communicate between two threads, a Sender and a Listener. The sender sends messages, which are expected to be rec…

stopping a python thread using __del__

I have a threaded program in Python that works fine except that __del__ does not get called once the thread is running:class tt(threading.Thread):def __init__(self):threading.Thread.__init__(self)self.…

Python-docx: Is it possible to add a new run to paragraph in a specific place (not at the end)

I want to set a style to a corrected word in MS Word text. Since its not possible to change text style inside a run, I want to insert a new run with new style into the existing paragraph...for p in doc…

Chained QSortFilterProxyModels

Lets say I have a list variable datalist storing 10,000 string entities. The QTableView needs to display only some of these entities. Thats is why QTableView was assigned QSortFilterProxyModel that doe…