rdatabaseappenddbi

Append new rows / data into data base if it's not there in R


I want to create a table in DWH to build historical information about something, this table get their data from live data base, the update in my table will be only in column in my table if this happened i want to append this new raw to my data in DWH.

# write the data in the DHW for the first time and then append the new data:

dbWriteTable(DWH, Id(schema = "public", table = "test_df"), customer, row.names=FALSE, overwrite=TRUE)


# i use this method to append the data but it append all data even if it is there:
dbAppendTable( DWH, "test_df", CUSTOMER,overwrite=TRUE)

What i want is to append only the new data to my table in the DWH not all the data. Is there a way to do this??


Solution

  • There are probably much more efficient ways to do this, but they will depend on your DBMS and privileges.

    A simple approach would be:

    library(dplyr)
    library(DBI)
    
    # read the table
    old_rows <- dbReadTable(DWH, "test_df")
    
    # remove rows from CUSTOMER which are already on the table
    rows_to_load <- CUSTOMER %>%
      anti_join(old_rows)
    
    # load the new rows
    dbAppendTable(DWH, "test_df", rows_to_load)
    

    The anti_join here will check for rows which are already on the data (i.e. all the variables match with your CUSTOMER data). If you want to check for specific columns/IDs to identify duplicates then use something like:

    rows_to_load <- CUSTOMER %>%
      anti_join(old_rows, by = join_by(column1, column2, ....))
    

    NOTE: if your database table is large, then this is not a very efficient approach as it reads the whole table into memory. But it should work regardless of which DBMS you are using.

    Approach using a temporary table in the database

    This should be more efficient but how it works may depend on which DBMS you are using. It works for me on SQL Server.

    This involves writing your dataframe into a temporary table in the database, then running SQL code to insert only the new rows into the main table. Make sure your data frame has the same columns and the same column order as the database table. If the temporary data is in any way different from the data in test_df then it will be added.

    dbWriteTable(DWH, "#temptable", CUSTOMER, temporary=TRUE, overwrite = TRUE)
    
    dbExecute(DWH, "insert into test_df select * from #temptable except select * from test_df")
    

    If you have a criteria for identifying duplicates other than checking if the entire record matches (e.g. if you have unique ID numbers), then you can replace the dbExecute() with something like the below. Just replace the where clause with the appropriate expression.

    dbExecute(DWH, "insert into test_df select * from #temptable t where not exists (select 1 from test_df d where t.ID = d.ID)")