rcsvreadr

How to read specific columns of a CSV when given the header as a vector


I have a large CSV file without a header row, and the header is available to me as a vector. I want to use a subset of the columns of the file without loading the entire file. The subset of columns required are provided as a separate list.

Edit: in this case, the column names provided in the header list are important. This MRE only has 4 column names, but the solution should work for a large dataset with pre-specified column names. The catch is that the column names are only provided externally, not as a header in the CSV file.

1,2,3,4
5,6,7,8
9,10,11,12
header <- c("A", "B", "C", "D")
subset <- c("D", "B")

So far I have been reading the data in the following manner, which gets me the result I want, but loads the entire file first.

# Setup

library(readr)

write.table(
  structure(list(V1 = c(1L, 5L, 9L), V2 = c(2L, 6L, 10L), V3 = c(3L, 7L, 11L), V4 = c(4L, 8L, 12L)), class = "data.frame", row.names = c(NA, -3L)),
  file="sample-data.csv",
  row.names=FALSE,
  col.names=FALSE,
  sep=","
)

header <- c("A", "B", "C", "D")
subset <- c("D", "B")

# Current approach

df1 <- read_csv(
  "sample-data.csv",
  col_names = header
)[subset]

df1
# A tibble: 3 × 2
      D     B
  <dbl> <dbl>
1     4     2
2     8     6
3    12    10

How can I get the same result without loading the entire file first?

Related questions


Solution

  • You can use readr::read_csv with col_names and col_select arguments.

    header <- c("A", "B", "C", "D")
    subset <- c("D", "B")
    
    readr::read_csv("sample_data.csv",
                    col_names = header,
                    col_select = any_of(subset))
    
    # # A tibble: 3 × 2
    #       D     B
    #   <dbl> <dbl>
    # 1     4     2
    # 2     8     6
    # 3    12    10