[SOLVED] Working with large CSV file using duckdb or arrow in R

Working with large CSV file using duckdb or arrow in R

I know nothing about databases nor duckdb nor arrow---hence the struggle, likely. I have a large CSV file (7.5 GB) on my hard drive. In R, without loading it into my memory, I want to:

Extract column names
Select a subset of columns and rows

And then load this subset dataset into R.

How can I do this? I am struggling to understand how to do this based on what I found online.

Solution

install.packages("pacman")
pacman::p_load(duckdb, arrow, dbplyr)
  
path_to_csv <- "c:/path/to/file.csv"

# Create a database from one local csv file (does not upload the data)
database <- arrow::open_dataset(source = path_to_csv, format = "csv") %>%
   arrow::to_duckdb()

# Final dataset
clean_loaded_data <- database %>%
   select(column_1) %>%
   filter(item %in% list) %>%
   collect()

Where collect() instructs R to upload the subset dataset into the environment. The source argument can also be a list of .csv files.