rdplyrtidy

read_delim for multiple files with different number of columns


I am trying to read in multiple text files with read_delim. However, these text files differ in how many columns they have. I am only interested in some of the columns which are common in all text files.

However, when I try to specify the columns with col_select, it still throws the error that the amount of columns are different. Here is a minimal example:

> df = read_delim(c('file1.txt', 'file2.txt'), col_select = 1)
Error: Files must all have 3 columns:
* File 2 has 2 columns

However, this works and only reads in the first column:

> df = read_delim('file1.txt', col_select = 1)
New names:                                                                                                                                                          
• `test2` -> `test2...2`
• `test2` -> `test2...3`
Rows: 1 Columns: 1
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (1): test1

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Content of file1.txt:

test1 test2 test3
1 2 3

Content of file2.txt:

test1 test2
1 2

Does anyone have any ideas how to read in text files which differ in the number of columns that they have?


Solution

  • As it seems to check the number of columns are equal and will error before column selection happens, you likely need to read each in separately and bind them:

    library(readr)
    library(purrr)
    
    set_names(c('file1.txt', 'file2.txt')) %>%
      map(read_delim, col_select = 1, show_col_types = FALSE) %>%
      list_rbind(names_to = "file_id")
    
    # A tibble: 2 × 2
      file_id   test1
      <chr>     <dbl>
    1 file1.txt     1
    2 file2.txt     1