pandas cut a series with nan values

2024/10/4 1:22:49

I would like to apply the pandas cut function to a series that includes NaNs. The desired behavior is that it buckets the non-NaN elements and returns NaN for the NaN-elements.

import pandas as pd
numbers_with_nan = pd.Series([3,1,2,pd.NaT,3])
numbers_without_nan = numbers_with_nan.dropna()

The cutting works fine for the series without NaNs:

pd.cut(numbers_without_nan, bins=[1,2,3], include_lowest=True)
0      (2.0, 3.0]
1    (0.999, 2.0]
2    (0.999, 2.0]
4      (2.0, 3.0]

When I cut the series that contains NaNs, element 3 is correctly returned as NaN, but the last element gets the wrong bin assigned:

pd.cut(numbers_with_nan, bins=[1,2,3], include_lowest=True)
0      (2.0, 3.0]
1    (0.999, 2.0]
2    (0.999, 2.0]
3             NaN
4    (0.999, 2.0]

How can I get the following output?

0      (2.0, 3.0]
1    (0.999, 2.0]
2    (0.999, 2.0]
3             NaN
4      (2.0, 3.0]
Answer

This is strange. The problem isn't pd.NaT, it's the fact your series has object dtype instead of a regular numeric series, e.g. float, int.

A quick fix is to replace pd.NaT with np.nan via fillna. This triggers series conversion from object to float64 dtype, and may also lead to better performance.

s = pd.Series([3, 1, 2, pd.NaT, 3])res = pd.cut(s.fillna(np.nan), bins=[1, 2, 3], include_lowest=True)print(res)0    (2, 3]
1    [1, 2]
2    [1, 2]
3       NaN
4    (2, 3]
dtype: category
Categories (2, object): [[1, 2] < (2, 3]]

A more generalized solution is to convert to numeric explicitly beforehand:

s = pd.to_numeric(s, errors='coerce')
https://en.xdnf.cn/q/70669.html

Related Q&A

Using Selenium with PyCharm CE

Im trying to use Selenium with PyCharm CE. I have installed Selenium using pip install Selenium and Im able to use it via the terminal however when I try to use it with PyCharm I get an import error Im…

Reusing generator expressions

Generator expressions is an extremely useful tool, and has a huge advantage over list comprehensions, which is the fact that it does not allocate memory for a new array.The problem I am facing with gen…

ModuleNotFoundError: No module named librosa

Currently I am working on voice recognition where I wanted to use Librosa library. I install librosa with the command on ubuntu: conda install -c conda-forge librosaBut when I run the code I got the fo…

Python - Convert Very Large (6.4GB) XML files to JSON

Essentially, I have a 6.4GB XML file that Id like to convert to JSON then save it to disk. Im currently running OSX 10.8.4 with an i7 2700k and 16GBs of ram, and running Python 64bit (double checked). …

Python create tree from a JSON file

Lets say that we have the following JSON file. For the sake of the example its emulated by a string. The string is the input and a Tree object should be the output. Ill be using the graphical notation …

disable `functools.lru_cache` from inside function

I want to have a function that can use functools.lru_cache, but not by default. I am looking for a way to use a function parameter that can be used to disable the lru_cache. Currently, I have a two ver…

How to clear tf.flags?

If I run this code twice:tf.flags.DEFINE_integer("batch_size", "2", "batch size for training")I will get this error:DuplicateFlagError: The flag batch_size is defined twic…

Stochastic Optimization in Python

I am trying to combine cvxopt (an optimization solver) and PyMC (a sampler) to solve convex stochastic optimization problems. For reference, installing both packages with pip is straightforward: pip in…

Pandas convert yearly to monthly

Im working on pulling financial data, in which some is formatted in yearly and other is monthly. My model will need all of it monthly, therefore I need that same yearly value repeated for each month. …

Firebase database data to R

I have a database in Google Firebase that has streaming sensor data. I have a Shiny app that needs to read this data and map the sensors and their values.I am trying to pull the data from Firebase into…