rdatabasedplyrdbplyr

dbplyr: delete row from a table in database


What is the dbplyr verbs combination that is equivalent to DBI::dbSendQuery(con, "DELETE FROM <table> WHERE <condition>").

What I want is not querying data from database, but removing data from and updating a table in database.

I want to do it in a dplyr way, but I am not sure if it is possible. I could not find anything similar in the package reference.


Solution

  • dbplyr translates dplyr commands to query database tables. I am not aware of any inbuilt way to modify existing database tables using pure dbplyr.

    This is likely a design choice.

    The DBI package is probably your best choice for modifying the database

    This is the approach I use for all my dbplyr work. Often a custom function that takes the query produced by dbplyr translation and inserting it into a DBI call (you can see examples of this in my dbplyr helpers GitHub repo).

    Two approaches to consider for this: (1) an anti-join (on all columns) followed by writing a new table, (2) the DELETE FROM syntax.

    Mock up of anti-join approach

    records_to_remove = remote_table %>%
      filter(conditions)
    
    desired_final_table = remote_table %>%
      anti_join(records_to_remove, by = colnames(remote_table))
    
    query = paste0("SELECT * INTO output_table FROM (",
                   sql_render(desired_final_table),
                   ") AS subquery")
    
    DBI::dbExecute(db_con, as.character(query))
    

    Mock up of DELETE FROM syntax

    records_to_remove = remote_table %>%
      filter(conditions)
    
    query = sql_render(records_to_remove) %>%
      as.character() %>%
      gsub(search_term = "SELECT *", replacement_term = "DELETE")
    
    DBI::dbExecute(db_con, query)
    

    If you plan to run these queries multiple times, then wrapping them in a function, with checks for validity would be recommended.

    For some use cases deleting rows will not be necessary.

    You could think of the filter command in R as deleting rows from a table. For example in R we might run:

    prepared_table = input_table %>%
      filter(colX == 1) %>%
      select(colA, colB, colZ)
    

    And think of this as deleting rows where colX == 1 before producing output:

    output = prepared_table %>%
      group_by(colA) %>%
      summarise(sumZ = sum(colZ))
    

    (Or you could use an anti-join above instead of a filter.)

    But for this type of deleting, you do not need to edit the source data, as you can just filter out the unwanted rows at runtime every time. Yes it will make your database query larger, but this is normal for working with databases.

    So combining the preparation and output in SQL is normal (something like this):

    SELECT colA, SUM(colZ) AS sumZ
    FROM (
    
      SELECT colA, colB, colZ
      FROM input_table
      WHERE colX = 1
    
    ) AS prepared_table
    GROUP BY colA
    

    So unless you need to modify the database, I would recommend filtering instead of deleting.