I'm trying to create a .h5 file with a dataset that contains the data from a .dat file. First, I approach this using numpy:
import numpy as np
import h5py
filename = 'VAL220408-invparms.dat'
datasetname = 'EM27_104_COCCON_VAL/220408'
dtvec = [float for i in range(149)] #My data file have 149 columns
dtvec[1] = str
dtvec[2] = str #I specify the dtype of the second and third column
dataset = np.genfromtxt(filename,skip_header=0,names=True,dtype=dtvec)
fh5 = h5py.File('my_data.h5', 'w')
fh5.create_dataset(datasetname,data=dataset)
fh5.flush()
fh5.close()
But when running I get the error:
TypeError: No conversion path for dtype: dtype('<U')
If I don't specify the dtype everything is fine, the dataset is in order and the numerical values are correct, just the second and third columns have values of NaN; and I don't want that.
I found that h5py does not support Numpy's encoding for strings, so I supposed that using a dataframe from pandas will work. My code using pandas is like this:
import numpy as np
import pandas as pd
filename = 'VAL220408-invparms.dat'
datasetname = 'EM27_104_COCCON_VAL/220408'
df = pd.read_csv(filename,header=0,sep="\s+")
fh5 = h5py.File('my_data.h5', 'w')
fh5.create_dataset(datasetname,data=df)
fh5.flush()
fh5.close()
But then I get the error:
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
Then I found that pandas had a function that transforms a dataframe into a .h5 file, so insted using h5py library I made:
df.to_hdf('my_data.h5','datasetname',format='table',mode='a')
BUT the data is all messed up in many tables inside the .h5 file. 😫
I really would like some help to just get the data of the second and third columns like it really is, a str.
I'm using Python 3.8
Thank you very much for reading.
I just figured it out.
In the h5py docs they say to specify the strings as h5py-strings using:
h5py.string_dtype(encoding='utf-8', length=None)
So in my first piece of code I put:
dtvec[1] = h5py.string_dtype(encoding='utf-8', length=None)
dtvec[2] = h5py.string_dtype(encoding='utf-8', length=None)
Hope this is helpful to someone reading this question.