I am referring this question to this. I am making this new thread because I did not really understand the answer given there and hopefully there is someone who could explain it more to me.
Basically my problem is like in the link there.Before, I use np.vstack
and create h5
format file from it. Below are my example:
import numpy as np
import h5py
import globpath="/home/ling/test/"def runtest():data1 = [np.loadtxt(file) for file in glob.glob(path + "data1/*.csv")]data2 = [np.loadtxt(file) for file in glob.glob(path + "data2/*.csv")]stack = np.vstack((data1, data2))h5f = h5py.File("/home/ling/test/2test.h5", "w") h5f.create_dataset("test_data", data=stack)h5f.close()
This works perfectly if the size is all same. But when the size is different, it throws me error TypeError: Object dtype dtype('O') has no native HDF5 equivalent
What I understand from the answer given there, I must save the arrays as separate dataset but looking at the example snippet given; for k,v in adict.items()
and grp.create_dataset(k,data=v)
, k
should be the name of the dataset correct? like from my example, test_data
? and what is v
?
Below are what it looks like for vstack
and also stack
[[array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])array([-0.07812, -0.07812, -0.11719, ..., -0.07812, -0.07812, 0. ])array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])array([ 0.03906, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.11719, 0. ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])array([-0.15625, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])array([-0.11719, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.15625, 0. ])array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])array([-0.07812, -0.07812, -0.07812, ..., -0.11719, -0.07812, 0. ])array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])array([-0.07812, -0.11719, -0.07812, ..., -0.07812, -0.07812, 0. ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])array([ 0.07812, 0.03906, 0.07812, ..., 0.03906, 0.07812, 0. ])array([ 0.03906, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.11719, 0. ])array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])][ array([ 10.9375 , 10.97656, 10.97656, ..., 11.05469, 11.05469, 1. ])array([ 11.01562, 11.01562, 11.01562, ..., 11.09375, 11.09375, 1. ])array([ 11.09375, 11.09375, 11.09375, ..., 11.09375, 11.09375, 1. ])array([ 10.97656, 11.01562, 11.01562, ..., 11.13281, 11.09375, 1. ])array([ 11.05469, 11.05469, 11.01562, ..., 11.09375, 11.09375, 1. ])array([ 11.05469, 11.05469, 11.05469, ..., 11.05469, 11.05469, 1. ])array([ 11.05469, 11.05469, 11.05469, ..., 11.05469, 11.13281, 1. ])array([ 11.05469, 11.09375, 11.09375, ..., 11.09375, 11.09375, 1. ])array([ 11.09375, 11.05469, 11.09375, ..., 11.05469, 11.05469, 1. ])array([ 11.05469, 11.05469, 11.05469, ..., 11.09375, 11.09375, 1. ])array([ 11.05469, 11.05469, 11.09375, ..., 11.05469, 11.05469, 1. ])array([ 10.97656, 10.97656, 10.97656, ..., 11.05469, 11.05469, 1. ])array([ 11.09375, 11.05469, 11.09375, ..., 11.09375, 11.09375, 1. ])array([ 11.05469, 11.05469, 11.05469, ..., 11.05469, 11.05469, 1. ])array([ 11.05469, 11.05469, 11.05469, ..., 11.09375, 11.17188, 1. ])array([ 11.09375, 11.09375, 11.09375, ..., 10.97656, 11.09375, 1. ])array([ 11.09375, 11.09375, 11.09375, ..., 11.05469, 11.05469, 1. ])array([ 11.05469, 11.05469, 11.05469, ..., 11.05469, 11.05469, 1. ])array([ 11.05469, 11.01562, 11.05469, ..., 11.01562, 11.01562, 1. ])array([ 10.78125, 10.78125, 10.78125, ..., 11.05469, 11.05469, 1. ])array([ 11.13281, 11.09375, 11.13281, ..., 11.09375, 11.09375, 1. ])array([ 11.13281, 11.09375, 11.09375, ..., 11.05469, 11.05469, 1. ])array([ 10.97656, 10.97656, 10.9375 , ..., 11.05469, 11.05469, 1. ])array([ 11.05469, 11.09375, 11.05469, ..., 11.09375, 11.09375, 1. ])array([ 10.9375 , 10.9375 , 10.9375 , ..., 11.09375, 11.09375, 1. ])array([ 11.05469, 11.05469, 11.05469, ..., 11.05469, 11.05469, 1. ])array([ 10.9375 , 10.89844, 10.9375 , ..., 11.05469, 11.09375, 1. ])array([ 10.9375 , 10.97656, 10.97656, ..., 11.05469, 11.05469, 1. ])array([ 10.89844, 10.89844, 10.89844, ..., 11.05469, 11.09375, 1. ])array([ 11.05469, 11.05469, 11.05469, ..., 11.01562, 11.01562, 1. ])]]
Thank you for your help and explanation.
Update
I solved the problem by using pandas. At first I used the exact suggestion by Pierre de Buyl but it gave me error when I tried to load/read the file/dataset. I tried with test_data = h5f["data1/file1"][:]
. This gave me an error saying that Unable to open object(Object 'file1' does not exist)
.
I checked by reading the 2test.h5
using pandas.read_hdf
and it shows that the file is empty. I searched online for other solution and I found this. I already modified it:
import numpy as np
import globimport pandas as pdpath = "/home/ling/test/"def runtest():data1 = [np.loadtxt(file) for file in glob.glob(path + "data1/*.csv")]data2 = [np.loadtxt(file) for file in glob.glob(path + "data2/*.csv")]df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)combine = df1.append(df2, ignore_index=True)# sort the NaN to the left
combinedf = combine.apply(lambda x : sorted(x, key=pd.notnull), 1)
combinedf.to_hdf('/home/ling/test/2test.h5', 'twodata')runtest()
For reading, I simply use
input_data = pd.read_hdf('2test.h5', 'twodata')
read_input = input_data.valuesread1 = read_input[:, -1] # read/get last column for example