rdatabaseduckdb

Working with large CSV file using duckdb or arrow in R


I know nothing about databases nor duckdb nor arrow---hence the struggle, likely. I have a large CSV file (7.5 GB) on my hard drive. In R, without loading it into my memory, I want to:

  1. Extract column names
  2. Select a subset of columns and rows

And then load this subset dataset into R.

How can I do this? I am struggling to understand how to do this based on what I found online.


Solution

  • install.packages("pacman")
    pacman::p_load(duckdb, arrow, dbplyr)
      
    path_to_csv <- "c:/path/to/file.csv"
    
    # Create a database from one local csv file (does not upload the data)
    database <- arrow::open_dataset(source = path_to_csv, format = "csv") %>%
       arrow::to_duckdb()
    
    # Final dataset
    clean_loaded_data <- database %>%
       select(column_1) %>%
       filter(item %in% list) %>%
       collect()
    

    Where collect() instructs R to upload the subset dataset into the environment. The source argument can also be a list of .csv files.