How to gracefully fallback to `NaN` value while reading integers from a CSV with Pandas?

2024/10/12 23:24:37

While using read_csv with Pandas, if i want a given column to be converted to a type, a malformed value will interrupt the whole operation, without an indication about the offending value.

For example, running something like:

import pandas as pd
import numpy as npdf = pd.read_csv('my.csv', dtype={ 'my_column': np.int64 })

Will lead to a stack trace ending with the error:

ValueError: cannot safely convert passed user dtype of <i8 for object dtyped data in column ...

If i had the row number, or the offending value in the error message, i could add it to the list of known NaN values, but this way there is nothing i can do.

Is there a way to tell the parser to ignore failures and return a np.nan in that case?

Post Scriptum: Funnily enough, after parsing without any type suggestion (no dtype argument), d['my_column'].value_counts() seems to infer the dtype right and put np.nan correctly automatically, even though the actual dtype for the series is a generic object which will fail on almost every plotting and statistical operation

Answer

Thanks to the comments i realised that there is no NaN for integers, which was very surprising to me. Thus i switched to converting to float:

import pandas as pd
import numpy as npdf = pd.read_csv('my.csv', dtype={ 'my_column': np.float64 })

This gave me an understandable error message with the value of the failing conversion, so that i could add the failing value to the na_values:

df = pd.read_csv('my.csv', dtype={ 'my_column': np.float64 }, na_values=['n/a'])

This way i could finally import the CSV in a way which works with visualisation and statistical functions:

>>>> df['session_planned_os'].dtype
dtype('float64')

Once you are able to spot the right na_values, you can remove the dtype argument from read_csv. Type inference will now happen correctly:

df = pd.read_csv('my.csv', na_values=['n/a'])
https://en.xdnf.cn/q/69597.html

Related Q&A

Python - object layout

can somebody describe the following exception? What is the "object layout" and how it is defined? ThanksTraceback (most recent call last):File "test_gui.py", line 5, in <module…

Using Tor proxy with scrapy

I need help setting up Tor in Ubuntu and to use it within scrapy framework.I did some research and found out this guide:class RetryChangeProxyMiddleware(RetryMiddleware):def _retry(self, request, reaso…

Best practice for structuring module exceptions in Python3

Suppose I have a project with a folder structure like so./project__init__.pymain.py/__helpers__init__.pyhelpers.py...The module helpers.py defines some exception and contains some method that raises th…

How can you read a gzipped parquet file in Python

I need to open a gzipped file, that has a parquet file inside with some data. I am having so much trouble trying to print/read what is inside the file. I tried the following: with gzip.open("myFil…

Pandas - combine row dates with column times

I have a dataframe:Date 0:15 0:30 0:45 ... 23:15 23:30 23:45 24:00 2004-05-01 3.74618 3.58507 3.30998 ... 2.97236 2.92008 2.80101 2.6067 2004-05-02 3.09098 3.846…

How to extract tables in Images

I wanted to extract tables from images.This python module https://pypi.org/project/ExtractTable/ with their website https://www.extracttable.com/pro.html doing the job very well but they have limited f…

Extract string if match the value in another list

I want to get the value of the lookup list instead of a boolean. I have tried the following codes:val = pd.DataFrame([An apple,a Banana,a cat,a dog]) lookup = [banana,dog] # I tried the follow code: va…

Automating HP Quality Center with Python or Java

We have a project that uses HP Quality Center and one of the regular issues we face is people not updating comments on the defect.So I was thinkingif we could come up with a small script or tool that c…

indexing numpy array with logical operator

I have a 2d numpy array, for instance as:import numpy as np a1 = np.zeros( (500,2) )a1[:,0]=np.arange(0,500) a1[:,1]=np.arange(0.5,1000,2) # could be also read from txtthen I want to select the indexes…

Stream multiple files into a readable object in Python

I have a function which processes binary data from a file using file.read(len) method. However, my file is huge and is cut into many smaller files 50 MBytes each. Is there some wrapper class that feeds…