Fastest way to drop rows / get subset with difference from large DataFrame in Pandas

2024/9/7 17:56:28

Question

I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.

So far I have two solutions, which seem relatively slow to me:

  1. df.loc[df.difference(indices)]

    which takes ~115 sec on my dataset

  2. df.drop(indices)

    which takes ~215 sec on my dataset

Is there a faster way to do this? Preferably in Pandas.

Performance of proposed Solutions

  • ~41 sec: df[~df.index.isin(indices)] by @jezrael
Answer

I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:

df1 = df[~df.index.isin(indices)]

As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:

df1 = df[~df.index.isin(indices)].copy()

This filtering depends of number of matched indices and also by length of DataFrame.

So another possible solution is create array/list of indices for keeping and then inverting is not necessary:

df1 = df[df.index.isin(need_indices)]
https://en.xdnf.cn/q/72915.html

Related Q&A

Python inheritance: when and why __init__

Im a Python newbie, trying to understand the philosophy/logic behind the inheritance methods. Questions ultimately regards why and when one has to use the __init__ method in a subclass. Example:It seem…

TypeError: A Future or coroutine is required

I try make auto-reconnecting ssh client on asyncssh. (SshConnectManager must stay in background and make ssh sessions when need)class SshConnectManager(object): def __init__(self, host, username, passw…

Python socket closed before all data have been consumed by remote

I am writing a Python module which is communicating with a go program through unix sockets. The client (the python module) write data to the socket and the server consume them.# Simplified version of t…

Python child process silently crashes when issuing an HTTP request

Im running into an issue when combining multiprocessing, requests (or urllib2) and nltk. Here is a very simple code:>>> from multiprocessing import Process >>> import requests >>…

Shared variable in concurrent.futures.ProcessPoolExecutor() python

I want to use parallel to update global variable using module concurrent.futures in pythonIt turned out that using ThreadPoolExecutor can update my global variable but the CPU did not use all their pot…

MongoEngine - Another user is already authenticated to this database. You must logout first

Can anyone please explain why I am getting error Another user is already authenticated to this database. You must logout first when connecting to MongoDB using Flask MongoEngine?from mongoengine.conne…

How to bucketize a group of columns in pyspark?

I am trying to bucketize columns that contain the word "road" in a 5k dataset. And create a new dataframe. I am not sure how to do that, here is what I have tried far : from pyspark.ml.featur…

Dictionary of tags in declarative SQLAlchemy?

I am working on a quite large code base that has been implemented using sqlalchemy.ext.declarative, and I need to add a dict-like property to one of the classes. What I need is the same as in this ques…

How to connect to a GObject signal in python, without it keeping a reference to the connecter?

The problem is basically this, in pythons gobject and gtk bindings. Assume we have a class that binds to a signal when constructed:class ClipboardMonitor (object):def __init__(self):clip = gtk.clipboar…

openpyxl please do not assume text as a number when importing

There are numerous questions about how to stop Excel from interpreting text as a number, or how to output number formats with openpyxl, but I havent seen any solutions to this problem:I have an Excel s…