rimportrbind

How to import/rbind multiple files with different indexed columns?


I have multiple files that have the same column variables (in wide format). But, they are indexed with different numbers at the end of the column names. The problem is that I can't import the files at once and rbind simultaneously.

Is it possible to streamline the following process:

  1. Import the files as a list.
  2. Use pivot_longer() to change from wide to long, names_pattern() to remove the index of the column names and into a new column variable.
  3. rbind into a single data frame?

Below is where I am at. It works, but for column names that are the same. Any solutions would be greatly appreciated.

library(tidyr)

subdir <- "01_Unprocessed" #name of folder in subdirectory  
files <- list.files(path = subdir, pattern = "hobo.csv", full.names = T)
names(files) <- tools::file_path_sans_ext(basename(files))
hobo <- do.call(rbind, lapply(files, function(x) read.csv(x, stringsAsFactors = F))) #where column names are the same

#change from wide to long and remove index from column names into a new column variable. 
hobo <- pivot_longer(hobo, 
             cols = -date.time,
             names_pattern = "([^_]+)\\_([^.]+)$",
             names_to = c('variable', 'id'),
             values_to = 'value')

Sample of two files with different column indexes.

#df1 
dput(df1)
structure(list(date.time = "11/25/24 09:00:00", wind.speed_20524321 = 0L, 
    gust.speed_20524321 = 0L, wind.direction_20524321 = 44L, 
    wind.speed_20524319 = 0L, gust.speed_20524319 = 0L, wind.direction_20524319 = 44L), class = "data.frame", row.names = c(NA, 
-1L))

#df2
dput(df2)
structure(list(date.time = "11/25/24 09:00:00", wind.speed_20524300 = 0L, 
    gust.speed_20524300 = 0L, wind.direction_20524300 = 44L, 
    wind.speed_20524319 = 0L, gust.speed_20524319 = 0L, wind.direction_20524319 = 44L), class = "data.frame", row.names = c(NA, 
-1L))

UPDATE

I have up to 40 files and would like to find an alternative solution to make this process faster. The expected output is to have one single dataframe with four columns:

  1. Date and time
  2. ID (the column index number)
  3. Variables names (wind speed, gust speed, and wind direction)
  4. Value

UPDATE Files are 70-80MB in size.


Solution

  • This has been asked many times. The solution is to use the names_sep argument with c('.value', 'id') so it knows you have multiple sets of columns to pivot.

    library(tidyr)
    
    rbind(pivot_longer(df1, 
                       cols = -date.time,
                       names_sep="_",
                       names_to = c('.value', 'id')),
          pivot_longer(df2, 
                       cols = -date.time,
                       names_sep="_",
                       names_to = c('.value', 'id')))
    

    Gives:

    # A tibble: 4 × 5
      date.time         id       wind.speed gust.speed wind.direction
      <chr>             <chr>         <int>      <int>          <int>
    1 11/25/24 09:00:00 20524321          0          0             44
    2 11/25/24 09:00:00 20524319          0          0             44
    3 11/25/24 09:00:00 20524300          0          0             44
    4 11/25/24 09:00:00 20524319          0          0             44
    

    And for brevity, in case you have many data frames to rbind, you can use do.call with lapply:

    do.call(rbind, lapply(list(df1, df2), pivot_longer, 
                          cols = -date.time,
                          names_sep="_",
                          names_to = c('.value', 'id')))
    

    So all you need to do is add the other data frames into your list above.

    And if your data frames are all in the form df1, df2, df3, ..., df40, then you can capture them with mget:

    df_list <- mget(ls(pattern="df\\d+"))
    

    And replace list(df1, df2) above with df_list