rdata.table

combining multiple .csv files and retaining folder name


I know this has been done before but I'm struggling to get it to work with my code. I have multiple .csv files with different headings/number of columns. The columns are in the same order and I really just need the first 3 columns and then I need to just replace the column headings so they are the same.

y_path   <- "Y:\\Public\\xx\\DataLoggers\\WaterTemp\\DarkCanyon\\DC Trib 1A\\"
all_csv  <- list.files(path = y_path, pattern = '.csv', full.names = TRUE)
open_csv <- lapply(all_csv, \(x) fread(x,skip=2)) 
one_df <- data.table::rbindlist(open_csv, fill=TRUE, idcol='names(all_csv)') 
one_df
df1<-one_df %>% select(1:3)
df1
colnames(df1) <- c("numb","datetime","temp") #change column names

output:

        numb             datetime   temp
       <int>               <char>  <num>
    1:     1      6/30/2014 13:37 17.272
    2:     2      6/30/2014 14:37 18.438
    3:     3      6/30/2014 15:37 18.152
    4:     4      6/30/2014 16:37 18.176
    5:     5      6/30/2014 17:37 18.200
   ---                                  
46230:   915 07/29/24 01:41:52 PM 11.722
46231:   916 07/29/24 02:41:52 PM 12.401
46232:   917 07/29/24 03:41:52 PM 13.269
46233:   918 07/29/24 04:10:09 PM     NA
46234:   919 07/29/24 04:10:12 PM     NA

The problem is I really want to include the path because ideally I will grab all these files from within the WaterTemp folder instead of individually running code for each subfolder separately. Once I have the pathway I should be able to pull the folder level (site) name from it. I'm not sure how with the way I have coded this to include the pathway name in a new column.

What I want

        numb             datetime   temp Site
       <int>               <char>  <num> <char>
    1:     1      6/30/2014 13:37 17.272 Y:\\Public\\Fisher Group\\DataLoggers\\WaterTemp\\DarkCanyon\\DC Trib 1A\\
    2:     2      6/30/2014 14:37 18.438 Y:\\Public\\Fisher Group\\DataLoggers\\WaterTemp\\DarkCanyon\\DC Trib 1A\\
    3:     3      6/30/2014 15:37 18.152 Y:\\Public\\Fisher Group\\DataLoggers\\WaterTemp\\DarkCanyon\\DC Trib 1A\\
    4:     4      6/30/2014 16:37 18.176 Y:\\Public\\Fisher Group\\DataLoggers\\WaterTemp\\DarkCanyon\\DC Trib 1A\\
    5:     5      6/30/2014 17:37 18.200 Y:\\Public\\Fisher Group\\DataLoggers\\WaterTemp\\DarkCanyon\\DC Trib 1A\\
   ---                                  
46230:   915 07/29/24 01:41:52 PM 11.722 Y:\\Public\\Fisher Group\\DataLoggers\\WaterTemp\\DarkCanyon\\DC Trib 1A\\
46231:   916 07/29/24 02:41:52 PM 12.401 Y:\\Public\\Fisher Group\\DataLoggers\\WaterTemp\\DarkCanyon\\DC Trib 1A\\
46232:   917 07/29/24 03:41:52 PM 13.269 Y:\\Public\\Fisher Group\\DataLoggers\\WaterTemp\\DarkCanyon\\DC Trib 1A\\
46233:   918 07/29/24 04:10:09 PM     NA Y:\\Public\\Fisher Group\\DataLoggers\\WaterTemp\\DarkCanyon\\DC Trib 1A\\
46234:   919 07/29/24 04:10:12 PM     NA Y:\\Public\\Fisher Group\\DataLoggers\\WaterTemp\\DarkCanyon\\DC Trib 1A\\

Thank you, Amanda


Solution

  • You need to change the column names before using rbindlist. Further, idcol= is great, but unless I'm missing something, your all_csv is not named, so the idcol is going to capture the index within the list, not the filename itself.

    Perhaps something like this:

    one_df <- lapply(
      setNames(nm = all_csv),
      \(x) setnames(fread(x, skip = 2, select = 1:3), c("numb","datetime","temp")
    ) |>
      rbindlist(idcol = "names(all_csv)")
    

    I'm using setNames(nm=x) as short-hand of setNames(x,nm=x), once the list/vector has names, lapply retains them.

    Edit (slight code-golf), motivated by Friede:

    one_df <- lapply(
      setNames(nm = all_csv),
      \(x) fread(x, skip = 2, col.names = c("numb", "datetime", "temp"), select = 1:3)
    ) |>
      rbindlist(idcol = "names(all_csv)")