I am new to working with databases and have been taking an approach that uses packages {duckdb}, {arrow} and {dbplyr}. I am doing some joins and would like to know the number of rows of the csv files I am working with. E.g.:
db1 <- arrow::open_dataset(source = paste0(path, "data1.csv"), format = "csv") %>%
arrow::to_duckdb()
db2 <- arrow::open_dataset(source = paste0(path, "data2.csv"), format = "csv") %>%
arrow::to_duckdb()
innerjoin_data <- inner_join(bd1, bd2, by = "id") %>%
collect()
Now, how do I get the number of rows of the respective csv of db1 and db2? How do you do the diagnostics to check how many did not match and why, when working with large datasets that you would rather not load in the machine (as in this case)?
As @r2evans suggests, R has an nrow
function for this purpose. This is the standard option when working with data in memory. However, it may not work on remote/database tables as expected.
Hence, if you want to do this starting from your current setup, I would use ungroup
and summarise
as follows:
db1 %>%
ungroup() %>%
summarise(num = n())
And repeat for db2
and innerjoin_data
.
This is the dbplyr equivalent of the common SQL query:
SELECT COUNT(*) AS num
FROM db1