pythonmatlabhdf5

Loading hdf5 matlab strings into Python


I'm running into trouble reading a hdf5 matlab 7.3 file with Python. I'm using h5py 2.0.1.

I can read all the matrices that are stored in the file, but I can not read a list of strings. h5py shows the strings as a dataset of shape (1, 894) with type |04. This data set contains object references, which I tried to dereference using the h5file[obj_ref] syntax.

This yields something like dataset "FFb": shape (4, 1) type "<u2". I interpreted that as a array of chars of length four. Which seems to be the ASCII representation of the string.

Is there an easy way to get the strings out?

Is there any package providing matlab to python hdf5 support?


Solution

  • I assume you mean it is a cell array of strings in MATLAB? This output looks normal: the dataset is an array of objects (|O4 is the NumPy object datatype). Each object is an array of 2-byte integers (<u2 is the NumPy little-endian unsigned 2-byte integer datatype). h5py has no way of knowing that the dataset is a cell array of strings; it could just as well be a cell array of arbitrary 16-bit integers.

    The easiest way to get the strings out would be to use an iterator using unichr to convert the characters, like this:

    strlist = [u''.join(unichr(c) for c in h5file[obj_ref]) for obj_ref in dataset])
    

    What this does is iterate over the dataset (for obj_ref in dataset) to create a new list. For each object reference, it dereferences the object (h5file[obj_ref]) to get an array of integers. It converts each integer into a character (unichr(c)) and joins those characters all together into a Unicode string (u''.join()).

    Note that this produces a list of unicode strings. If you are absolutely sure that every string contains only ASCII characters, you can replace u'' by '' and unichr by chr.

    Caveat: I don't have h5py; this post is based on my experiences with MATLAB and NumPy. You may need to adjust the syntax or iteration order to suite your dataset.