pythonpandasnumpy-ndarraymat-file

Extract data fields from inconsistent MATLAB data formats into pandas dataframe


Is there a uniform way to comb through .mat files' field names in python and extract corresponding data regardless of their format?

These .mat files include:

Where I don't necessarily know n but I know n is consistent among the variables(field names) extracted

I want to create a function that takes m field names and a .mat file path and returns an nXm pandas dataframe.

My code so far:

import scipy.io as sio
import numpy as np
import pandas as pd

def mat2df(mat_file, var_list): 
    #mat_file is a file path and var_list is a list of strings corresponding to structure field names
    df = pd.DataFrame()
    surface_mat = sio.loadmat(mat_file)
    for i in list(surface_mat):
        if "__" not in i and "readme" not in i: #strip away top dict layer of mat file
            mat = surface_mat[i] #mat is an ndarray
            if mat.dtype.names is not None: #if mat is a 1Xn structure
                for j in mat.dtype.names:
                    if j in var_list: #if variable is named by user    
                        karray = np.reshape(np.transpose(mat[j]),(-1))
                        #append dataframe column
                        df[j] = pd.Series(karray, index=range(len(karray)),dtype=mat[j][0][0].dtype)

            elif mat[0][0].dtype.names is not None: # if mat is a 1Xn cell array of 1X1 structures
                for j in mat[0][0].dtype.names:
                    if j in var_list: #if variable is named by user
                        karray = np.array([])
                        for k in range(len(mat[0])):
                            karray = np.append(karray,mat[0][k][j][0][0])
                        #append dataframe column
                        df[j] = pd.Series(karray, index=range(len(karray)),dtype=mat[j][0][0].dtype)

            else: #Unfortunately, code format doesn't leave many options for data formatting 
                raise NotImplementedError("Current MATLAB data format not yet supported \
                    \nCurrent support covers structures and cell arrays of structures")
    return df

This code only covers the first 2 mat file types listed. Is there a methodology I can use here where I don't have to write a new if statement for each possible layer of nested structure or cell array?


Solution

  • I made a python package that does this for me. hope anyone else stumbling upon this question can use it too.