sqlrmariadbdbirmariadb

Is there an efficient way to update MariaDB rows from a data frame in R?


I am collecting live data from a website which populates a data frame in R. The rows could have the same unique id's, or new rows could be introduced. I want to send the dynamic data frame to a MariaDB database table, where the rows with existing unique id's update the columns that I have specified, rows without existing unique id's get inserted in the table as new rows. I can get this to work with the MariaDB INSERT ON DUPLICATE KEY UPDATE statement, and a function that generates the needed values from the dynamic data frame.

MWE:

install.packages("odbc")
insall.packages("RMariaDB")
library(odbc)
library(RMariaDB)

con <- dbConnect(RMariaDB::MariaDB(), host = Sys.getenv('MARIADB_DBHOST'), 
port = Sys.getenv('MARIADB_DBPORT'), user = Sys.getenv('MARIADB_DBUSER'), 
password = Sys.getenv('MARIADB_DBPW'), db = Sys.getenv('MARIADB_DBNAME'), 
timeout = 5)

# Database table for mwe to work.
db_live <- data.frame(id = c(12, 22, 32), car_name = c("rolls royce","nissan","mercedes benz"), km = c(123,100,150), temp = c(78,60,80))


# Get table from database, id column is unique index.
db_live <- dbReadTable(con, "db_live")
print(db_live)
  id      car_name  km temp
1 12    rols royce 123   78
2 22        nissan 100   60
3 32 mercedes benz 150   80

# Build dynamic dataframe
df_live <- data.frame(id = c(12, 22, 32, 42), 
car_name = c("rolls royce","nissan","mercedes benz", "aston martin"),
km = c(140,120,200,40), temp = c(81,65,85,50))

print(df_live)
  id      car_name  km temp
1 12    rols royce 140   81
2 22        nissan 120   65
3 32 mercedes benz 200   85
4 42  aston martin  40   50

# create function that generates a string with values for dbSendQuery.
gen_insert_values <- function(df) {
for(i in 1:nrow(df)) {
row_string <- paste(df[i,1], paste0("'",df[i,2],"'"), df[i,3], df[i,4], 
collapse = ", ")

if(exists("df_string")) {
  df_string <- paste0(df_string,", ",paste0("(",row_string,")"))
} else {
  df_string <- paste0("(",row_string,")")
}
}
df_string
}

values <- gen_insert_values(df_live)

print(values)
"(12 'rolls royce' 140 81), (22 'nissan' 120 65), (32 'mercedes benz' 200 85), (42 'aston martin' 40 50)"

# Send query.
res <- dbSendQuery(con, paste0("INSERT INTO db_live (id,car_name,km,temp) VALUES ", values," ON DUPLICATE KEY UPDATE km = VALUES(km), temp = VALUES(temp);"))
dbClearResult(res)

#Check db table after sent query.

new_db_live <- dbReadTable(con, "db_live")

print(new_db_live)
  id      car_name  km temp
1 12   rolls royce 140   81
2 22        nissan 120   65
3 32 mercedes benz 200   85
4 42  aston martin  40   50

This does not seem very efficient, as I have to change the query and the function in case I want to update more columns, and I include a for loop in my function which can cause the script to be slow.

Is there a more efficient way to solve this problem?


Solution

  • Here's a method that is likely more efficient: the use of a temporary table instead of manually encoding the data as a string of (a,b,c),(a,b,c) data sets.

    For the sake of complete-demonstration, I've modified the df_live data slightly so that we have one row with no change, one row with updated data, and one row that is new. This process also works unaltered with your original df_live, I just wanted to highlight the three modes.

    Technically, though, the "no change" row does update the database, but it is not obvious. If the table has a "lastmodified" field that updates with the current timestamp when something in the row is updated, then you can see a little more of what is happening.

    In fact, I'll add (just for demonstration) two fields: created and modified, which show when the row was first created and when the last update occurred. These are not required for normal UPSERTs.

    Setup

    This section should not be necessary, unless you don't have a primary key(s) on the table (in which case, add one).

    I'll name the main table "mydata", and upload the db_live dataset into it. I believe (without extensive testing) that MariaDB requires UPSERTs to find the duplicate or conflict rows based on pre-existing keys. This means we'll need to set a (primary) key; I'll assume your table already has this (and show how I do it with manually-uploaded data).

    db <- DBI::dbConnect(RMariaDB::MariaDB(), ...)
    db_live <- data.frame(id = c(12, 22, 32),
                          car_name = c("rolls royce","nissan","mercedes benz"),
                          km = c(123,100,150), temp = c(78,60,80))
    df_live <- data.frame(id = c(12, 22, 42), 
                          car_name = c("rolls royce","nissan","aston martin"),
                          km = c(140,120,40), temp = c(81,65,50))
    df_live
    #   id     car_name  km temp
    # 1 12  rolls royce 140   81   # updated
    # 2 22       nissan 100   60   # no change
    # 3 42 aston martin  40   50   # new data
    
    DBI::dbWriteTable(db, "mydata", db_live)
    DBI::dbExecute(db, "alter table mydata add primary key (id)")
    # [1] 0
    DBI::dbExecute(db, "
      alter table mydata
        add column created timestamp not null default CURRENT_TIMESTAMP,
        add column modified timestamp null default null")
    # [1] 0
    DBI::dbExecute(db, "
      create trigger updatemodified_mydata
      before update on mydata
      for each row set NEW.modified = CURRENT_TIMESTAMP")
    # [1] 0
    
    DBI::dbGetQuery(db, "select * from mydata")
    #   id      car_name  km temp             created modified
    # 1 12   rolls royce 123   78 2021-06-29 19:11:00     <NA>
    # 2 22        nissan 100   60 2021-06-29 19:11:00     <NA>
    # 3 32 mercedes benz 150   80 2021-06-29 19:11:00     <NA>
    

    If you don't have a primary key on the main table mydata, then the "UPSERT" operation will simply insert (add) all rows without updating. I don't know if there's a way to trick mariadb into faking keys for the purpose of correctly triggering your intended "update if exists" logic.

    UPSERT

    We'll use a temporary table so that the data to be upserted will not persist; this is good for several reasons, and if you do it correctly then your DBAs will thank you :-)

    (If you aren't familiar with temp tables ... they are not visible to other users on the db, often not visible to different connections for the same user, and will be dropped when the connection is closed.)

    DBI::dbCreateTable(db, "mytemp", df_live, temporary = TRUE)
    DBI::dbWriteTable(db, "mytemp", df_live, append = TRUE, create = FALSE)
    DBI::dbGetQuery(db, "select * from mytemp")
    #   id     car_name  km temp
    # 1 12  rolls royce 140   81
    # 2 22       nissan 100   60
    # 3 42 aston martin  40   50
    
    DBI::dbExecute(db, "
      insert into mydata (id,car_name,km,temp)
      select id,car_name,km,temp from mytemp
      on duplicate key update km=VALUES(km), temp=VALUES(temp);")
    # [1] 5
    
    DBI::dbGetQuery(db, "select * from mydata")
    #   id      car_name  km temp             created            modified
    # 1 12   rolls royce 140   81 2021-06-29 19:11:00 2021-06-29 19:11:27
    # 2 22        nissan 100   60 2021-06-29 19:11:00 2021-06-29 19:11:27
    # 3 32 mercedes benz 150   80 2021-06-29 19:11:00                <NA>
    # 4 42  aston martin  40   50 2021-06-29 19:11:27                <NA>
    

    If you note, even though the values for "nissan" were not different, the row was still allegedly updated as evidenced by the modified timestamp. The "change" row we had, "rolls royce", shows the appropriate modified time. The mercedes benz was uploaded the first time and not updated, and the aston martin was updated the second time, so its created time is different than the others.


    Reproduction

    I did this with the mariadb:latest docker image. These steps below are purely for demonstration, and are not provided as the canonical way to manage a database (for security or performance). Yes I'm connecting to the "mysql" database, which is not where user data should be going, I believe ... it was hasty, please forgive me.

    $ docker pull mariadb:latest
    Using default tag: latest
    latest: Pulling from library/mariadb
    c549ccf8d472: Pull complete
    26ea6552a462: Pull complete
    329b1f41043f: Pull complete
    9f8d09317d80: Pull complete
    2bc055a5511d: Pull complete
    e989e430508e: Pull complete
    cdba2af19f87: Pull complete
    04fe4f90eab8: Pull complete
    389c6b423e31: Pull complete
    bef640655d86: Pull complete
    Digest: sha256:0c72b63198ac53df4e84db821876c73794b00509b2d8a77100d186a13e49ac31
    Status: Downloaded newer image for mariadb:latest
    docker.io/library/mariadb:latest
    
    $ docker run -p 127.0.0.1:3306:3306  --name some-mariadb \
        -e MARIADB_ROOT_PASSWORD=mysecretpw -e MARIADB_DATABASE=mydb -d mariadb:latest
    

    In R, connecting is straight-forward:

    db <- DBI::dbConnect(RMariaDB::MariaDB(), host="127.0.0.1", port=3306, 
                         username="root", password="mysecretpw", dbname="mydb")
    DBI::dbGetQuery(db, "select version() as dbver")
    #                                   dbver
    # 1 10.5.11-MariaDB-1:10.5.11+maria~focal