Question 1

I have a large pandas dataframe (size = 3 GB):

x = read.table('big_table.txt', sep='\t', header=0, index_col=0)

Because I'm working under memory constraints, I subset the dataframe:

rows = calculate_rows() # a function that calculates what rows I need
cols = calculate_cols() # a function that calculates what cols I need
x = x.iloc[rows, cols]

The functions that calculate the rows and columns are not important, but they are DEFINITELY a smaller subset of the original rows and columns. However, when I do this operation, memory usage increases by a lot! The original goal was to shrink the memory footprint to less than 3GB, but instead, memory usage goes well over 6GB.

I'm guessing this is because Python creates a local copy of the dataframe in memory, but doesn't clean it up. There may also be other things that are happening... So my question is how do I subset a large dataframe and clean up the space? I can't find a function that selects rows/cols in place.

I have read a lot of Stack Overflow, but can't find much on this topic. It could be I'm not using the right keywords, so if you have suggestions, that could also help. Thanks!

Question 2

You are much better off doing something like this:

Specify usecols to sub-select which columns you want in the first place to read_csv, see here.

Then read the file in chunks, see here, if the rows that you want are select, shunt them to off, finally concatenating the result.

Pseudo-code ish:

reader = pd.read_csv('big_table.txt', sep='\t', header=0, index_col=0, usecols=the_columns_i_want_to_use, chunksize=10000)df = pd.concat([ chunk.iloc[rows_that_I_want_] for chunk in reader ])

This will have a constant memory usage (the size of a chunk)

plus the selected rows usage x 2, which will happen when you concat the rows after the concat the usage will go down to selected rows usage

pandas data frame - select rows and clear memory?

Related Q&A

How do I format a websocket request?

cherrypy and wxpython

What is the logic behind d3.js nice() ticks

Changing iterable variable during loop

Are CPython, IronPython, Jython scripts compatible with each other?

Python. Print mac address out of 6 byte string

Cursors with postgres, where is the data stored and how many calls to the DB

Django - How to allow only the owner of a new post to edit or delete the post?

Py4J has bigger overhead than Jython and JPype

how to uninstall opencv-python package installed by using pip in anaconda?