Why do pandas and dask perform better when importing from CSV compared to HDF5?

2024/10/3 0:26:52

I am working with a system that currently operates with large (>5GB) .csv files. To increase performance, I am testing (A) different methods to create dataframes from disk (pandas VS dask) as well as (B) different ways to store results to disk (.csv VS hdf5 files).

In order to benchmark performance, I did the following:

def dask_read_from_hdf():results_dd_hdf = dd.read_hdf('store.h5', key='period1', columns = ['Security'])analyzed_stocks_dd_hdf =  results_dd_hdf.Security.unique()hdf.close()def pandas_read_from_hdf():results_pd_hdf = pd.read_hdf('store.h5', key='period1', columns = ['Security'])analyzed_stocks_pd_hdf =  results_pd_hdf.Security.unique()hdf.close()def dask_read_from_csv():results_dd_csv = dd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])analyzed_stocks_dd_csv =  results_dd_csv.Security.unique()def pandas_read_from_csv():results_pd_csv = pd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])analyzed_stocks_pd_csv =  results_pd_csv.Security.unique()print "dask hdf performance"
%timeit dask_read_from_hdf()
gc.collect()
print""
print "pandas hdf performance"
%timeit pandas_read_from_hdf()
gc.collect()
print""
print "dask csv performance"
%timeit dask_read_from_csv()
gc.collect()
print""
print "pandas csv performance"
%timeit pandas_read_from_csv()
gc.collect()

My findings are:

dask hdf performance
10 loops, best of 3: 133 ms per looppandas hdf performance
1 loop, best of 3: 1.42 s per loopdask csv performance
1 loop, best of 3: 7.88 ms per looppandas csv performance
1 loop, best of 3: 827 ms per loop

When hdf5 storage can be accessed faster than .csv, and when dask creates dataframes faster than pandas, why is dask from hdf5 slower than dask from csv? Am I doing something wrong?

When does it make sense for performance to create dask dataframes from HDF5 storage objects?

Answer

HDF5 is most efficient when working with numerical data, I'm guessing you are reading a single string column, which is its weakpoint.

Performance of string data with HDF5 can be dramatically improved by using a Categorical to store your strings, assuming relatively low cardinality (high number of repeated values)

It's from a little while back, but a good blog post here going through exactly these considerations. http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization

You may also look at using parquet - it is similar to HDF5 in that it is a binary format, but is column oriented, so a single column selection like this will likely be faster.

Recently (2016-2017) there has been significant work to implement a fast native reader of parquet->pandas, and the next major release of pandas (0.21) will have to_parquet and pd.read_parquet functions built in.

https://arrow.apache.org/docs/python/parquet.html

https://fastparquet.readthedocs.io/en/latest/

https://matthewrocklin.com/blog//work/2017/06/28/use-parquet

https://en.xdnf.cn/q/70794.html

Related Q&A

is there any pool for ThreadingMixIn and ForkingMixIn for SocketServer?

I was trying to make an http proxy using BaseHttpServer which is based on SocketServer which got 2 asynchronous Mixins (ThreadingMixIn and ForkingMixIn)the problem with those two that they work on each…

Python - Read data from netCDF file with time as seconds since beginning of measurement

I need to extract values from a netCDf file. I am pretty new to python and even newer this file format. I need to extract time series data at a specific location (lat, lon). I have found that there is …

PyQt Multiline Text Input Box

I am working with PyQt and am attempting to build a multiline text input box for users. However, when I run the code below, I get a box that only allows for a single line of text to be entered. How to …

Calculate the sum of model properties in Django

I have a model Order which has a property that calculates an order_total based on OrderItems linked by foreign key.I would like to calculate the sum of a number of Order instances order_total propertie…

Set Host-header when using Python and urllib2

Im using my own resolver and would like to use urllib2 to just connect to the IP (no resolving in urllib2) and I would like set the HTTP Host-header myself. But urllib2 is just ignoring my Host-header:…

Full-featured date and time library

Im wondering if anyone knows of a good date and time library that has correctly-implemented features like the following:Microsecond resolution Daylight savings Example: it knows that 2:30am did not exi…

Mean of a correlation matrix - pandas data fram

I have a large correlation matrix in a pandas python DataFrame: df (342, 342).How do I take the mean, sd, etc. of all of the numbers in the upper triangle not including the 1s along the diagonal?Thank…

How to set imshow scale

Im fed up with matplotlib in that its so hard to plot images in specified size.Ive two images in 32*32, 20*20 sizes. I just want to plot them in its original size, or in proportion to its original size…

Python distutils gcc path

Im trying to cross-compile the pycrypto package, and Im getting closer and closer however, Ive hit an issue I just cant figure out.I want distutils to use the cross-compile specific gcc- so I set the C…

TypeError: builtin_function_or_method object has no attribute __getitem__

Ive got simple python functions.def readMainTemplate(templateFile):template = open(templateFile, r)data = template.read()index1 = data.index[[] #originally I passed it into data[]index2 = data.index[]]…