pythonpandasnumpyh5pygenfromtxt

h5py doesn't support NumPy dtype('U') (Unicode) and pandas doesn't support NumPy dtype('O')


I'm trying to create a .h5 file with a dataset that contains the data from a .dat file. First, I approach this using numpy:

import numpy as np
import h5py

filename = 'VAL220408-invparms.dat'
datasetname = 'EM27_104_COCCON_VAL/220408'

dtvec = [float for i in range(149)] #My data file have 149 columns
dtvec[1] = str
dtvec[2] = str #I specify the dtype of the second and third column

dataset = np.genfromtxt(filename,skip_header=0,names=True,dtype=dtvec)

fh5 = h5py.File('my_data.h5', 'w')
fh5.create_dataset(datasetname,data=dataset)
fh5.flush()
fh5.close()

But when running I get the error:

TypeError: No conversion path for dtype: dtype('<U')

If I don't specify the dtype everything is fine, the dataset is in order and the numerical values are correct, just the second and third columns have values of NaN; and I don't want that.

I found that h5py does not support Numpy's encoding for strings, so I supposed that using a dataframe from pandas will work. My code using pandas is like this:

import numpy as np
import pandas as pd

filename = 'VAL220408-invparms.dat'
datasetname = 'EM27_104_COCCON_VAL/220408'

df = pd.read_csv(filename,header=0,sep="\s+")

fh5 = h5py.File('my_data.h5', 'w')
fh5.create_dataset(datasetname,data=df)
fh5.flush()
fh5.close()

But then I get the error:

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

Then I found that pandas had a function that transforms a dataframe into a .h5 file, so insted using h5py library I made:

df.to_hdf('my_data.h5','datasetname',format='table',mode='a')

BUT the data is all messed up in many tables inside the .h5 file. 😫

I really would like some help to just get the data of the second and third columns like it really is, a str.

I'm using Python 3.8

Thank you very much for reading.


Solution

  • I just figured it out.

    In the h5py docs they say to specify the strings as h5py-strings using:

    h5py.string_dtype(encoding='utf-8', length=None)
    

    So in my first piece of code I put:

    dtvec[1] = h5py.string_dtype(encoding='utf-8', length=None) 
    dtvec[2] = h5py.string_dtype(encoding='utf-8', length=None) 
    

    Hope this is helpful to someone reading this question.