how to save h5py arrays with different sizes?

2024/10/12 4:34:07

I am referring this question to this. I am making this new thread because I did not really understand the answer given there and hopefully there is someone who could explain it more to me.

Basically my problem is like in the link there.Before, I use np.vstack and create h5 format file from it. Below are my example:

import numpy as np
import h5py
import globpath="/home/ling/test/"def runtest():data1 = [np.loadtxt(file) for file in glob.glob(path + "data1/*.csv")]data2 = [np.loadtxt(file) for file in glob.glob(path + "data2/*.csv")]stack = np.vstack((data1, data2))h5f = h5py.File("/home/ling/test/2test.h5", "w") h5f.create_dataset("test_data", data=stack)h5f.close()

This works perfectly if the size is all same. But when the size is different, it throws me error TypeError: Object dtype dtype('O') has no native HDF5 equivalent

What I understand from the answer given there, I must save the arrays as separate dataset but looking at the example snippet given; for k,v in adict.items()and grp.create_dataset(k,data=v), k should be the name of the dataset correct? like from my example, test_data? and what is v ?

Below are what it looks like for vstack and also stack

[[array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])array([-0.07812, -0.07812, -0.11719, ..., -0.07812, -0.07812,  0.     ])array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])array([ 0.03906,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.11719,  0.     ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])array([-0.15625, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])array([-0.11719, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.15625,  0.     ])array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])array([-0.07812, -0.07812, -0.07812, ..., -0.11719, -0.07812,  0.     ])array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])array([-0.07812, -0.11719, -0.07812, ..., -0.07812, -0.07812,  0.     ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])array([ 0.07812,  0.03906,  0.07812, ...,  0.03906,  0.07812,  0.     ])array([ 0.03906,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.11719,  0.     ])array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])][ array([ 10.9375 ,  10.97656,  10.97656, ...,  11.05469,  11.05469,   1.     ])array([ 11.01562,  11.01562,  11.01562, ...,  11.09375,  11.09375,   1.     ])array([ 11.09375,  11.09375,  11.09375, ...,  11.09375,  11.09375,   1.     ])array([ 10.97656,  11.01562,  11.01562, ...,  11.13281,  11.09375,   1.     ])array([ 11.05469,  11.05469,  11.01562, ...,  11.09375,  11.09375,   1.     ])array([ 11.05469,  11.05469,  11.05469, ...,  11.05469,  11.05469,   1.     ])array([ 11.05469,  11.05469,  11.05469, ...,  11.05469,  11.13281,   1.     ])array([ 11.05469,  11.09375,  11.09375, ...,  11.09375,  11.09375,   1.     ])array([ 11.09375,  11.05469,  11.09375, ...,  11.05469,  11.05469,   1.     ])array([ 11.05469,  11.05469,  11.05469, ...,  11.09375,  11.09375,   1.     ])array([ 11.05469,  11.05469,  11.09375, ...,  11.05469,  11.05469,   1.     ])array([ 10.97656,  10.97656,  10.97656, ...,  11.05469,  11.05469,   1.     ])array([ 11.09375,  11.05469,  11.09375, ...,  11.09375,  11.09375,   1.     ])array([ 11.05469,  11.05469,  11.05469, ...,  11.05469,  11.05469,   1.     ])array([ 11.05469,  11.05469,  11.05469, ...,  11.09375,  11.17188,   1.     ])array([ 11.09375,  11.09375,  11.09375, ...,  10.97656,  11.09375,   1.     ])array([ 11.09375,  11.09375,  11.09375, ...,  11.05469,  11.05469,   1.     ])array([ 11.05469,  11.05469,  11.05469, ...,  11.05469,  11.05469,   1.     ])array([ 11.05469,  11.01562,  11.05469, ...,  11.01562,  11.01562,   1.     ])array([ 10.78125,  10.78125,  10.78125, ...,  11.05469,  11.05469,   1.     ])array([ 11.13281,  11.09375,  11.13281, ...,  11.09375,  11.09375,   1.     ])array([ 11.13281,  11.09375,  11.09375, ...,  11.05469,  11.05469,   1.     ])array([ 10.97656,  10.97656,  10.9375 , ...,  11.05469,  11.05469,   1.     ])array([ 11.05469,  11.09375,  11.05469, ...,  11.09375,  11.09375,   1.     ])array([ 10.9375 ,  10.9375 ,  10.9375 , ...,  11.09375,  11.09375,   1.     ])array([ 11.05469,  11.05469,  11.05469, ...,  11.05469,  11.05469,   1.     ])array([ 10.9375 ,  10.89844,  10.9375 , ...,  11.05469,  11.09375,   1.     ])array([ 10.9375 ,  10.97656,  10.97656, ...,  11.05469,  11.05469,   1.     ])array([ 10.89844,  10.89844,  10.89844, ...,  11.05469,  11.09375,   1.     ])array([ 11.05469,  11.05469,  11.05469, ...,  11.01562,  11.01562,   1.     ])]]

Thank you for your help and explanation.

Update

I solved the problem by using pandas. At first I used the exact suggestion by Pierre de Buyl but it gave me error when I tried to load/read the file/dataset. I tried with test_data = h5f["data1/file1"][:]. This gave me an error saying that Unable to open object(Object 'file1' does not exist).

I checked by reading the 2test.h5 using pandas.read_hdf and it shows that the file is empty. I searched online for other solution and I found this. I already modified it:

import numpy as np
import globimport pandas as pdpath = "/home/ling/test/"def runtest():data1 = [np.loadtxt(file) for file in glob.glob(path + "data1/*.csv")]data2 = [np.loadtxt(file) for file in glob.glob(path + "data2/*.csv")]df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)combine = df1.append(df2, ignore_index=True)# sort the NaN to the left
combinedf = combine.apply(lambda x : sorted(x, key=pd.notnull), 1)
combinedf.to_hdf('/home/ling/test/2test.h5', 'twodata')runtest()

For reading, I simply use

input_data = pd.read_hdf('2test.h5', 'twodata')
read_input = input_data.valuesread1 = read_input[:, -1] # read/get last column for example
Answer

The basic elements in a HDF5 file are groups (similar to directories) and datasets (similar to arrays).

NumPy will create an array with a lot of different inputs. When one attempts to create an array from disparate elements (i.e. different lengths), NumPy returns an array of type 'O'. Look for object_ in the NumPy reference guide. Then, there is little advantage to use NumPy as this resembles a standard Python list.

HDF5 cannot store arrays of type 'O' because it does not have generic datatypes (only some support for C struct type objects).

The most obvious solution to your problem is to store your data in HDF5 dataset, with "one dataset" per table. You retain the advantage of collecting the data in a single file and you have "dict-like" access to the elements.

Try the following code:

import numpy as np
import h5py
import globpath="/home/ling/test/"def runtest():h5f = h5py.File("/home/ling/test/2test.h5", "w") h5f.create_group('data1')h5f.create_group('data2')[h5f.create_dataset(file[:-4], data=np.loadtxt(file)) for file in glob.glob(path + "data1/*.csv")][h5f.create_dataset(file[:-4], data=np.loadtxt(file)) for file in glob.glob(path + "data2/*.csv")]h5f.close()

For reading:

h5f = h5py.File("/home/ling/test/2test.h5", "r")
test_data = h5f['data1/thefirstfilenamewithoutcsvextension'][:]
https://en.xdnf.cn/q/118243.html

Related Q&A

Cannot allocate memory on Popen commands

I have a VPS server with Ubuntu 11.10 64bit and sometimes when I execute a subprocess.Popen command I get am getting too much this error:OSError: [Errno 12] Cannot allocate memoryConfig details: For ea…

Python - find where the plot crosses the axhline on python plot

I am doing some analysis on some simple data, and I am trying to plot auto-correlation and partial auto-correlation. Using these plots, I am trying to find the P and Q value to plot in my ARIMA model.I…

remove tick labels in Python but keep gridlines

I have a Python script which is producing a plot consisting of 3 subplots all in 1 column.In the middle subplot, I currently have gridlines, but I want to remove the x axis tick labels.I have triedax2.…

Signal in PySide not emitted when called by a timer

I need to emit a signal periodically. A timer executes certain function, which emits the signal that I want. For some reason this function is not being emitted. I was able to reproduce the error on min…

pybuilder and pytest: cannot import source code when running tests

so i have a project:<root> |- src|-main|-python|-data_merger|- common|- constans|- controller|- resources|- rest|-tests|-unittest|-integrationtestdata_merger is marked as root (I am using Pycharm…

HTTPS proxy server python

I have a problem with my ssl server (in Python). I set the SSL proxy connection in my browser, and try to connect to my ssl server.This is the server:import BaseHTTPServer, SimpleHTTPServer import sslh…

Python 2.7.6 + unicode_literals - UnicodeDecodeError: ascii codec cant decode byte

Im trying to print the following unicode string but Im receiving a UnicodeDecodeError: ascii codec cant decode byte error. Can you please help form this query so it can print the unicode string properl…

Retrieving data from Quandl with Python

How can I get the latest prices from a Quandl dataset with the Python API (https://www.quandl.com/help/python)? On https://www.quandl.com/help/api, it says "You can use rows=n to get only the fir…

Django: Using same object how to show 2 different results in django template?

Using the same object how to SHOW 2 different results using django template ?In one page there are two divs, it should show different information using the same object.INPUTobject data has follows[{&q…

Override attribute access precedence having a data descriptor

I have a bunch of instances of a MongoEngine model. And the profiler shows that a lot of time is spent in __get__ method of MongoEngine model fields:ncalls tottime percall cumtime percall filename:…