pandas data frame - select rows and clear memory?

2024/10/10 16:24:22

I have a large pandas dataframe (size = 3 GB):

x = read.table('big_table.txt', sep='\t', header=0, index_col=0)

Because I'm working under memory constraints, I subset the dataframe:

rows = calculate_rows() # a function that calculates what rows I need
cols = calculate_cols() # a function that calculates what cols I need
x = x.iloc[rows, cols]

The functions that calculate the rows and columns are not important, but they are DEFINITELY a smaller subset of the original rows and columns. However, when I do this operation, memory usage increases by a lot! The original goal was to shrink the memory footprint to less than 3GB, but instead, memory usage goes well over 6GB.

I'm guessing this is because Python creates a local copy of the dataframe in memory, but doesn't clean it up. There may also be other things that are happening... So my question is how do I subset a large dataframe and clean up the space? I can't find a function that selects rows/cols in place.

I have read a lot of Stack Overflow, but can't find much on this topic. It could be I'm not using the right keywords, so if you have suggestions, that could also help. Thanks!

Answer

You are much better off doing something like this:

Specify usecols to sub-select which columns you want in the first place to read_csv, see here.

Then read the file in chunks, see here, if the rows that you want are select, shunt them to off, finally concatenating the result.

Pseudo-code ish:

reader = pd.read_csv('big_table.txt', sep='\t', header=0, index_col=0, usecols=the_columns_i_want_to_use, chunksize=10000)df = pd.concat([ chunk.iloc[rows_that_I_want_] for chunk in reader ])

This will have a constant memory usage (the size of a chunk)

plus the selected rows usage x 2, which will happen when you concat the rows after the concat the usage will go down to selected rows usage

https://en.xdnf.cn/q/69872.html

Related Q&A

How do I format a websocket request?

Im trying to create an application in Python that powers a GPIO port when the balance of a Dogecoin address changes. Im using the websocket API here and this websocket client.My code looks like this:fr…

cherrypy and wxpython

Im trying to make a cherrypy application with a wxpython ui. The problem is both libraries use closed loop event handlers. Is there a way for this to work? If I have the wx ui start cherrypy is that g…

What is the logic behind d3.js nice() ticks

I have generated some charts in d3.js. I use the following code to calculate the values to put in my y axis which works like a charm.var s = d3.scale.linear().domain([minValue, maxValue]); var ticks = …

Changing iterable variable during loop

Let it be an iterable element in python. In what cases is a change of it inside a loop over it reflected? Or more straightforward: When does something like this work?it = range(6) for i in it:it.remo…

Are CPython, IronPython, Jython scripts compatible with each other?

I am pretty sure that python scripts will work in all three, but I want to make sure. I have read here and there about editors that can write CPython, Jython, IronPython and I am hoping that I am look…

Python. Print mac address out of 6 byte string

I have mac address in 6 byte string. How would you print it in "human" readable format?Thanks

Cursors with postgres, where is the data stored and how many calls to the DB

Hi I am using psycopg2 for postgres access.I am trying to understand where "cursor" stores the returned rows. Does it store it in the database as a temporary table or is it on the clients en…

Django - How to allow only the owner of a new post to edit or delete the post?

I will be really grateful if anyone can help to resolve the issue below. I have the following Django project coding. The problem is: when the browser was given "/posts/remove/<post_id>/"…

Py4J has bigger overhead than Jython and JPype

After searching for an option to run Java code from Django application(python), I found out that Py4J is the best option for me. I tried Jython, JPype and Python subprocess and each of them have certai…

how to uninstall opencv-python package installed by using pip in anaconda?

I have tried to install OpenCV in anaconda. but when I use it, I figure out the instead of using OpenCV, the program using OpenCV-python and that why my program crashed. I type "conda uninstall op…