I'm trying to combine two arrays (5000,2)
containing integers and (5000,7)
containing float. I need to assign names to act as column headers when written to h5, however when I try to assign names and data types every column in the array gets repeated 9 times.
My code is as follows:
namesList = ['EID', 'Domain', 'LAM ID1', 'LAM ID2','LAM ID3', 'LAM ID4', 'LAM ID5', 'LAM ID6', 'LAM ID7']
formatsList = ['int', 'int', 'float', 'float', 'float', 'float', 'float', 'float', 'float']
ds_dt = np.dtype({'names':namesList, 'formats':formatsList})
Final_Lam_Strength = np.concatenate((LAM_Strength_RFs_Data, LAM_Strength_RFs), axis=1).astype(ds_dt)
Thanks
Load data directly to HDF5
If your goal is to load the data to HDF5 with h5py, there's no need to duplicate the data in another array. You can do it directly by creating the dataset then adding the data. The procedure is shown below with some simple data I created:
namesList = ['EID', 'Domain', 'LAM ID1', 'LAM ID2','LAM ID3', 'LAM ID4', 'LAM ID5', 'LAM ID6', 'LAM ID7']
formatsList = ['int', 'int', 'float', 'float', 'float', 'float', 'float', 'float', 'float']
ds_dt = np.dtype({'names':namesList, 'formats':formatsList})
# the simple data
nrows, nints, nfloats = 5,2,7
LAM_Strength_RFs_Data = np.arange(nrows*nints).reshape(nrows,nints)
LAM_Strength_RFs = np.arange(nrows*nfloats).reshape(nrows,nfloats)
with h5py.File('SO_77346149.h5', 'w') as h5f:
ds = h5f.create_dataset('Final_Lam_Strength',shape=(nrows,),dtype=ds_dt)
for i in range(nints):
ds[namesList[i]] = LAM_Strength_RFs_Data[:,i]
for i in range(nfloats):
ds[namesList[i+2]] = LAM_Strength_RFs[:,i]
Create NumPy structured array
Now, if you really need a NumPy array, create it with with np.empty()
and define the shape with the number of rows and the dtype
with ds_dt
. Then load the data using the named fields and column references.
This continues with data from example above:
Final_Lam_Strength = np.empty(shape=(nrows,),dtype=ds_dt)
print(Final_Lam_Strength.dtype, Final_Lam_Strength.shape)
for i in range(nints):
Final_Lam_Strength[namesList[i]] = LAM_Strength_RFs_Data[:,i]
for i in range(nfloats):
Final_Lam_Strength[namesList[i+2]] = LAM_Strength_RFs[:,i]
print(Final_Lam_Strength[0]) # first row
print(Final_Lam_Strength[-1]) # last row
print(Final_Lam_Strength['Domain']) # 'Domain' column
Create NumPy record array
Function np.core.records.fromarrays
is mentioned in the comments above. For completeness, here is an alternate method using that method. You do NOT need to create the empty array before calling this function.
arrayList = [LAM_Strength_RFs_Data[:,0], LAM_Strength_RFs_Data[:,1]] + \
[LAM_Strength_RFs[:,i] for i in range(nfloats)]
Final_Lam_Strength = np.core.records.fromarrays(arrayList, dtype=ds_dt)
print(Final_Lam_Strength.dtype, Final_Lam_Strength.shape)
print(Final_Lam_Strength[0]) # first row
print(Final_Lam_Strength[-1]) # last row
print(Final_Lam_Strength.Domain) # access 'Domain' column by attribute name
Notes on structured vs record arrays
The 1st NumPy method creates a structured array, and the 2nd method method creates a record array. They are similar, but slightly different. Record arrays provide access to the field (column) using dot notation. Example print statements provided to show the difference. Also, the np.core.records.fromarrays
method requires an intermediate data structure (the list of arrays). This won't be a problem with your data, but could be an issue if you have a lot of data (say 10E6 rows and 200 columns).