Question 1

I am working with a system that currently operates with large (>5GB) .csv files. To increase performance, I am testing (A) different methods to create dataframes from disk (pandas VS dask) as well as (B) different ways to store results to disk (.csv VS hdf5 files).

In order to benchmark performance, I did the following:

def dask_read_from_hdf():results_dd_hdf = dd.read_hdf('store.h5', key='period1', columns = ['Security'])analyzed_stocks_dd_hdf =  results_dd_hdf.Security.unique()hdf.close()def pandas_read_from_hdf():results_pd_hdf = pd.read_hdf('store.h5', key='period1', columns = ['Security'])analyzed_stocks_pd_hdf =  results_pd_hdf.Security.unique()hdf.close()def dask_read_from_csv():results_dd_csv = dd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])analyzed_stocks_dd_csv =  results_dd_csv.Security.unique()def pandas_read_from_csv():results_pd_csv = pd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])analyzed_stocks_pd_csv =  results_pd_csv.Security.unique()print "dask hdf performance"
%timeit dask_read_from_hdf()
gc.collect()
print""
print "pandas hdf performance"
%timeit pandas_read_from_hdf()
gc.collect()
print""
print "dask csv performance"
%timeit dask_read_from_csv()
gc.collect()
print""
print "pandas csv performance"
%timeit pandas_read_from_csv()
gc.collect()

My findings are:

dask hdf performance
10 loops, best of 3: 133 ms per looppandas hdf performance
1 loop, best of 3: 1.42 s per loopdask csv performance
1 loop, best of 3: 7.88 ms per looppandas csv performance
1 loop, best of 3: 827 ms per loop

When hdf5 storage can be accessed faster than .csv, and when dask creates dataframes faster than pandas, why is dask from hdf5 slower than dask from csv? Am I doing something wrong?

When does it make sense for performance to create dask dataframes from HDF5 storage objects?

Question 2

HDF5 is most efficient when working with numerical data, I'm guessing you are reading a single string column, which is its weakpoint.

Performance of string data with HDF5 can be dramatically improved by using a Categorical to store your strings, assuming relatively low cardinality (high number of repeated values)

It's from a little while back, but a good blog post here going through exactly these considerations. http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization

You may also look at using parquet - it is similar to HDF5 in that it is a binary format, but is column oriented, so a single column selection like this will likely be faster.

Recently (2016-2017) there has been significant work to implement a fast native reader of parquet->pandas, and the next major release of pandas (0.21) will have to_parquet and pd.read_parquet functions built in.

https://arrow.apache.org/docs/python/parquet.html

https://fastparquet.readthedocs.io/en/latest/

https://matthewrocklin.com/blog//work/2017/06/28/use-parquet

Why do pandas and dask perform better when importing from CSV compared to HDF5?

Related Q&A

is there any pool for ThreadingMixIn and ForkingMixIn for SocketServer?

Python - Read data from netCDF file with time as seconds since beginning of measurement

PyQt Multiline Text Input Box

Calculate the sum of model properties in Django

Set Host-header when using Python and urllib2

Full-featured date and time library

Mean of a correlation matrix - pandas data fram

How to set imshow scale

Python distutils gcc path

TypeError: builtin_function_or_method object has no attribute getitem