rrhdf5

How to read multiple files and create a single data frame from them in R?


Objective

I have 100 hdf5 files in a folder. For a reproducible example let's consider only 2 files, namely:

> list.files(pattern="*.hdf5")
[1] "Cars_20160601_01.hdf5" "Cars_20160601_02.hdf5"  

Each hdf5 file contains 2 groups, data and frame. I want to extract out 2 objects from data group. These are called VDS_Veh_Speed and VDS_Chassis_CG_Position. Similarly, in the frame group there are 3 objects. Only the object frame is relevant in this group.
I want to read these files and extract the relevant variables described above.

What I tried:

# Create a list all the hdf5 files
temp = list.files(pattern="*.hdf5")

# Read all files and create data frames from each using the file name as df name
for (i in unique(temp)){
  data <- h5read(file = i, name = "data") # ED data
  frame <- h5read(file = i, name = "frame") # Frame numbers
  ED <- data.frame(frames = frame$frame, 
                   speed.kph.ED = round(data$VDS_Veh_Speed*1.46667*0.3048*3.6,2),
                   pedal_pos = data$CFS_Accelerator_Pedal_Position)#fps

  df <- h5read(file = i, name = "data/VDS_Chassis_CG_Position")
  df <- as.data.frame(df)
  colnames(df) <- c("y", "x", "z")
  df$speed <- ED$speed.kph.ED 
  df$pedal_pos <- ED$pedal_pos
  df$file.ID <- i
  assign(i, df)
}  

Now, because I have all the files in the Global environment, I removed the extra objects and only kept the new dfs:

# Remove extra objects
rm(data, df, ED, frame, i, temp)

Finally, I made a list of the dfs in the environment and then created a single data frame:

DF_obj <- lapply(ls(), get)
fdc <- do.call("rbind", DF_obj)   

This works for me. But, as mentioned in the comments, assign should be avoided. Also, I have to manually use rm(), without which this code won't work. Is there any way to avoid assign in this context?

If you need the data files, here is the link to the 2 mentioned above: https://1drv.ms/f/s!AsMFpkDhWcnw6g7StJp9dzZ-nCr4


Solution

  • The answer is basically the same as your code, but with a couple minor changes. We just use a list and do normal assign to elements of the list rather than using assign() to create data frames in your global environment. This saves potential bugs, name clashes, and having to worry about extensive clean-up.

    temp = list.files(pattern="*.hdf5")
    df_list = list()  # initialize a list
    
    # Read all files into a list of data frames
    for (i in unique(temp)){
      data <- h5read(file = i, name = "data") # ED data
      frame <- h5read(file = i, name = "frame") # Frame numbers
      ED <- data.frame(frames = frame$frame, 
                       speed.kph.ED = round(data$VDS_Veh_Speed*1.46667*0.3048*3.6,2),
                       pedal_pos = data$CFS_Accelerator_Pedal_Position)#fps
    
      df <- h5read(file = i, name = "data/VDS_Chassis_CG_Position")
      df <- as.data.frame(df)
      colnames(df) <- c("y", "x", "z")
      df$speed <- ED$speed.kph.ED 
      df$pedal_pos <- ED$pedal_pos
    
      # assign to the list. We can take care of the id cols automatically
      df_list[[i]] <- df
    } 
    
    names(df) <- unique(temp)
    fdc <- data.table::rbindlist(df_list, idcol = "file.ID")
    

    Using data.table::rbindlist will be faster than using do.call(rbind), and it takes care of the ID column for us based on the names of the list.