I have multiple files that have the same column variables (in wide format). But, they are indexed with different numbers at the end of the column names. The problem is that I can't import the files at once and rbind
simultaneously.
Is it possible to streamline the following process:
pivot_longer()
to change from wide to long, names_pattern()
to remove the index of the column names and into a new column variable.rbind
into a single data frame?Below is where I am at. It works, but for column names that are the same. Any solutions would be greatly appreciated.
library(tidyr)
subdir <- "01_Unprocessed" #name of folder in subdirectory
files <- list.files(path = subdir, pattern = "hobo.csv", full.names = T)
names(files) <- tools::file_path_sans_ext(basename(files))
hobo <- do.call(rbind, lapply(files, function(x) read.csv(x, stringsAsFactors = F))) #where column names are the same
#change from wide to long and remove index from column names into a new column variable.
hobo <- pivot_longer(hobo,
cols = -date.time,
names_pattern = "([^_]+)\\_([^.]+)$",
names_to = c('variable', 'id'),
values_to = 'value')
Sample of two files with different column indexes.
#df1
dput(df1)
structure(list(date.time = "11/25/24 09:00:00", wind.speed_20524321 = 0L,
gust.speed_20524321 = 0L, wind.direction_20524321 = 44L,
wind.speed_20524319 = 0L, gust.speed_20524319 = 0L, wind.direction_20524319 = 44L), class = "data.frame", row.names = c(NA,
-1L))
#df2
dput(df2)
structure(list(date.time = "11/25/24 09:00:00", wind.speed_20524300 = 0L,
gust.speed_20524300 = 0L, wind.direction_20524300 = 44L,
wind.speed_20524319 = 0L, gust.speed_20524319 = 0L, wind.direction_20524319 = 44L), class = "data.frame", row.names = c(NA,
-1L))
UPDATE
I have up to 40 files and would like to find an alternative solution to make this process faster. The expected output is to have one single dataframe with four columns:
UPDATE Files are 70-80MB in size.
This has been asked many times. The solution is to use the names_sep
argument with c('.value', 'id')
so it knows you have multiple sets of columns to pivot.
library(tidyr)
rbind(pivot_longer(df1,
cols = -date.time,
names_sep="_",
names_to = c('.value', 'id')),
pivot_longer(df2,
cols = -date.time,
names_sep="_",
names_to = c('.value', 'id')))
Gives:
# A tibble: 4 × 5
date.time id wind.speed gust.speed wind.direction
<chr> <chr> <int> <int> <int>
1 11/25/24 09:00:00 20524321 0 0 44
2 11/25/24 09:00:00 20524319 0 0 44
3 11/25/24 09:00:00 20524300 0 0 44
4 11/25/24 09:00:00 20524319 0 0 44
And for brevity, in case you have many data frames to rbind, you can use do.call
with lapply
:
do.call(rbind, lapply(list(df1, df2), pivot_longer,
cols = -date.time,
names_sep="_",
names_to = c('.value', 'id')))
So all you need to do is add the other data frames into your list above.
And if your data frames are all in the form df1, df2, df3, ..., df40, then you can capture them with mget
:
df_list <- mget(ls(pattern="df\\d+"))
And replace list(df1, df2)
above with df_list