rdbplyrduckdb

Check dimensions of database object using duckdb and arrow


I am new to working with databases and have been taking an approach that uses packages {duckdb}, {arrow} and {dbplyr}. I am doing some joins and would like to know the number of rows of the csv files I am working with. E.g.:

db1 <- arrow::open_dataset(source = paste0(path, "data1.csv"), format = "csv") %>%
   arrow::to_duckdb()

db2 <- arrow::open_dataset(source = paste0(path, "data2.csv"), format = "csv") %>%
   arrow::to_duckdb()

innerjoin_data <- inner_join(bd1, bd2, by = "id") %>%
   collect()

Now, how do I get the number of rows of the respective csv of db1 and db2? How do you do the diagnostics to check how many did not match and why, when working with large datasets that you would rather not load in the machine (as in this case)?


Solution

  • As @r2evans suggests, R has an nrow function for this purpose. This is the standard option when working with data in memory. However, it may not work on remote/database tables as expected.

    Hence, if you want to do this starting from your current setup, I would use ungroup and summarise as follows:

    db1 %>%
      ungroup() %>%
      summarise(num = n())
    

    And repeat for db2 and innerjoin_data.

    This is the dbplyr equivalent of the common SQL query:

    SELECT COUNT(*) AS num
    FROM db1