Why does testing `NaN == NaN` not work for dropping from a pandas dataFrame?

2024/10/14 3:18:31

Please explain how NaN's are treated in pandas because the following logic seems "broken" to me, I tried various ways (shown below) to drop the empty values.

My dataframe, which I load from a CSV file using read.csv, has a column comments, which is empty most of the time.

The column marked_results.comments looks like this; all the rest of the column is NaN, so pandas loads empty entries as NaNs, so far so good:

0       VP
1       VP
2       VP
3     TEST
4      NaN
5      NaN
....

Now I try to drop those entries, only this works:

  • marked_results.comments.isnull()

All these don't work:

  • marked_results.comments.dropna() only gives the same column, nothing gets dropped, confusing.
  • marked_results.comments == NaN only gives a series of all Falses. Nothing was NaNs... confusing.
  • likewise marked_results.comments == nan

I also tried:

comments_values = marked_results.comments.unique()array(['VP', 'TEST', nan], dtype=object)# Ah, gotya! so now ive tried:
marked_results.comments == comments_values[2]
# but still all the results are Falses!!!
Answer

You should use isnull and notnull to test for NaN (these are more robust using pandas dtypes than numpy), see "values considered missing" in the docs.

Using the Series method dropna on a column won't affect the original dataframe, but do what you want:

In [11]: df
Out[11]:comments
0       VP
1       VP
2       VP
3     TEST
4      NaN
5      NaNIn [12]: df.comments.dropna()
Out[12]:
0      VP
1      VP
2      VP
3    TEST
Name: comments, dtype: object

The dropna DataFrame method has a subset argument (to drop rows which have NaNs in specific columns):

In [13]: df.dropna(subset=['comments'])
Out[13]:comments
0       VP
1       VP
2       VP
3     TESTIn [14]: df = df.dropna(subset=['comments'])
https://en.xdnf.cn/q/69462.html

Related Q&A

Python dynamic import methods from file [duplicate]

This question already has answers here:How can I import a module dynamically given its name as string?(10 answers)How can I import a module dynamically given the full path?(37 answers)Closed last yea…

Python list does not shuffle in a loop

Im trying to create an randomized list of keys by iterating:import randomkeys = [1, 2, 3, 4, 5] random.shuffle(keys) print keysThis works perfect. However, if I put it in a loop and capture the output:…

Python extract max value from nested dictionary

I have a nested dictionary of the form:{2015-01-01: {time: 8, capacity: 5}, 2015-01-02: {time: 8, capacity: 7},2015-01-03: {time: 8, capacity: 8} etc}The dictionary is created from a csv file using dic…

Exponential Decay on Python Pandas DataFrame

Im trying to efficiently compute a running sum, with exponential decay, of each column of a Pandas DataFrame. The DataFrame contains a daily score for each country in the world. The DataFrame looks lik…

TensorFlow - why doesnt this sofmax regression learn anything?

I am aiming to do big things with TensorFlow, but Im trying to start small. I have small greyscale squares (with a little noise) and I want to classify them according to their colour (e.g. 3 categories…

Extended example to understand CUDA, Numba, Cupy, etc

Mostly all examples of Numba, CuPy and etc available online are simple array additions, showing the speedup from going to cpu singles core/thread to a gpu. And commands documentations mostly lack good …

Python 2 newline tokens in tokenize module

I am using the tokenize module in Python and wonder why there are 2 different newline tokens:NEWLINE = 4 NL = 54Any examples of code that would produce both tokens would be appreciated.

Prevent encoding errors in Python

I have scripts which print out messages by the logging system or sometimes print commands. On the Windows console I get error messages likeTraceback (most recent call last):File "C:\Python32\lib\l…

How do I get the operating system name in a friendly manner using Python 2.5?

I tried:print os.nameAnd the output I got was::ntHowever, I want output more like "Windows 98", or "Linux".After suggestions in this question, I also tried:import os print os.name i…

Extend dataclass __repr__ programmatically

Suppose I have a dataclass with a set method. How do I extend the repr method so that it also updates whenever the set method is called: from dataclasses import dataclass @dataclass class State:A: int …