Fastest way to drop rows / get subset with difference from large DataFrame in Pandas

2024/9/7 17:56:28

Question

I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.

So far I have two solutions, which seem relatively slow to me:

df.loc[df.difference(indices)]

which takes ~115 sec on my dataset
df.drop(indices)

which takes ~215 sec on my dataset

Is there a faster way to do this? Preferably in Pandas.

Performance of proposed Solutions

~41 sec: df[~df.index.isin(indices)] by @jezrael

Answer

I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:

df1 = df[~df.index.isin(indices)]

As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:

df1 = df[~df.index.isin(indices)].copy()

This filtering depends of number of matched indices and also by length of DataFrame.

So another possible solution is create array/list of indices for keeping and then inverting is not necessary:

df1 = df[df.index.isin(need_indices)]

Fastest way to drop rows / get subset with difference from large DataFrame in Pandas

Question

Performance of proposed Solutions

Related Q&A

Python inheritance: when and why init

TypeError: A Future or coroutine is required

Python socket closed before all data have been consumed by remote

Python child process silently crashes when issuing an HTTP request

Shared variable in concurrent.futures.ProcessPoolExecutor() python

MongoEngine - Another user is already authenticated to this database. You must logout first

How to bucketize a group of columns in pyspark?

Dictionary of tags in declarative SQLAlchemy?

How to connect to a GObject signal in python, without it keeping a reference to the connecter?

openpyxl please do not assume text as a number when importing