Question 1

I have to clean a input data file in python. Due to typo error, the datafield may have strings instead of numbers. I would like to identify all fields which are a string and fill these with NaN using pandas. Also, I would like to log the index of those fields.

One of the crudest way is to loop through each and every field and checking whether it is a number or not, but this consumes lot of time if the data is big.

My csv file contains data similar to the following table:

Country  Count  Sales
USA         1   65000
UK          3    4000
IND         8       g
SPA         3    9000
NTH         5   80000

.... Assume that i have 60,000 such rows in the data.

Ideally I would like to identify that row IND has an invalid value under SALES column. Any suggestions on how to do this efficiently?

Question 2

There is a na_values argument to read_csv:

na_values : list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values

df = pd.read_csv('city.csv', sep='\s+', na_values=['g'])In [2]: df
Out[2]:Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000

Using pandas.isnull, you can select only those rows with NaN in the 'Sales' column, or the 'Country' series:

In [3]: df[pd.isnull(df['Sales'])]
Out[3]: Country  Count  Sales
2     IND      8    NaNIn [4]: df[pd.isnull(df['Sales'])]['Country']
Out[4]: 
2    IND
Name: Country

If it's already in the DataFrame you could use apply to convert those strings which are numbers into integers (using str.isdigit):

df = pd.DataFrame({'Count': {0: 1, 1: 3, 2: 8, 3: 3, 4: 5}, 'Country': {0: 'USA', 1: 'UK', 2: 'IND', 3: 'SPA', 4: 'NTH'}, 'Sales': {0: '65000', 1: '4000', 2: 'g', 3: '9000', 4: '80000'}})In [12]: df
Out[12]: Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8      g
3     SPA      3   9000
4     NTH      5  80000In [13]: df['Sales'] = df['Sales'].apply(lambda x: int(x) if str.isdigit(x)else np.nan)In [14]: df
Out[14]: Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000

cleaning big data using python

Related Q&A

Using the Python shell in Vi mode on Windows

Calculate residual deviance from scikit-learn logistic regression model

Use Python to create 2D coordinate

How to pass Unicode title to matplotlib?

Cythonize but not compile .pyx files using setup.py

How to clear matplotlib labels in legend?

Threading and Signals problem in PyQt

stopping a python thread using del

Python-docx: Is it possible to add a new run to paragraph in a specific place (not at the end)

Chained QSortFilterProxyModels