I know nothing about databases nor duckdb nor arrow---hence the struggle, likely. I have a large CSV file (7.5 GB) on my hard drive. In R, without loading it into my memory, I want to:
And then load this subset dataset into R.
How can I do this? I am struggling to understand how to do this based on what I found online.
install.packages("pacman")
pacman::p_load(duckdb, arrow, dbplyr)
path_to_csv <- "c:/path/to/file.csv"
# Create a database from one local csv file (does not upload the data)
database <- arrow::open_dataset(source = path_to_csv, format = "csv") %>%
arrow::to_duckdb()
# Final dataset
clean_loaded_data <- database %>%
select(column_1) %>%
filter(item %in% list) %>%
collect()
Where collect()
instructs R to upload the subset dataset into the environment. The source argument can also be a list of .csv files.