I am collecting live data from a website which populates a data frame in R. The rows could have the same unique id's, or new rows could be introduced. I want to send the dynamic data frame to a MariaDB database table, where the rows with existing unique id's update the columns that I have specified, rows without existing unique id's get inserted in the table as new rows. I can get this to work with the MariaDB INSERT ON DUPLICATE KEY UPDATE statement, and a function that generates the needed values from the dynamic data frame.
MWE:
install.packages("odbc")
insall.packages("RMariaDB")
library(odbc)
library(RMariaDB)
con <- dbConnect(RMariaDB::MariaDB(), host = Sys.getenv('MARIADB_DBHOST'),
port = Sys.getenv('MARIADB_DBPORT'), user = Sys.getenv('MARIADB_DBUSER'),
password = Sys.getenv('MARIADB_DBPW'), db = Sys.getenv('MARIADB_DBNAME'),
timeout = 5)
# Database table for mwe to work.
db_live <- data.frame(id = c(12, 22, 32), car_name = c("rolls royce","nissan","mercedes benz"), km = c(123,100,150), temp = c(78,60,80))
# Get table from database, id column is unique index.
db_live <- dbReadTable(con, "db_live")
print(db_live)
id car_name km temp
1 12 rols royce 123 78
2 22 nissan 100 60
3 32 mercedes benz 150 80
# Build dynamic dataframe
df_live <- data.frame(id = c(12, 22, 32, 42),
car_name = c("rolls royce","nissan","mercedes benz", "aston martin"),
km = c(140,120,200,40), temp = c(81,65,85,50))
print(df_live)
id car_name km temp
1 12 rols royce 140 81
2 22 nissan 120 65
3 32 mercedes benz 200 85
4 42 aston martin 40 50
# create function that generates a string with values for dbSendQuery.
gen_insert_values <- function(df) {
for(i in 1:nrow(df)) {
row_string <- paste(df[i,1], paste0("'",df[i,2],"'"), df[i,3], df[i,4],
collapse = ", ")
if(exists("df_string")) {
df_string <- paste0(df_string,", ",paste0("(",row_string,")"))
} else {
df_string <- paste0("(",row_string,")")
}
}
df_string
}
values <- gen_insert_values(df_live)
print(values)
"(12 'rolls royce' 140 81), (22 'nissan' 120 65), (32 'mercedes benz' 200 85), (42 'aston martin' 40 50)"
# Send query.
res <- dbSendQuery(con, paste0("INSERT INTO db_live (id,car_name,km,temp) VALUES ", values," ON DUPLICATE KEY UPDATE km = VALUES(km), temp = VALUES(temp);"))
dbClearResult(res)
#Check db table after sent query.
new_db_live <- dbReadTable(con, "db_live")
print(new_db_live)
id car_name km temp
1 12 rolls royce 140 81
2 22 nissan 120 65
3 32 mercedes benz 200 85
4 42 aston martin 40 50
This does not seem very efficient, as I have to change the query and the function in case I want to update more columns, and I include a for loop in my function which can cause the script to be slow.
Is there a more efficient way to solve this problem?
Here's a method that is likely more efficient: the use of a temporary table instead of manually encoding the data as a string of (a,b,c),(a,b,c)
data sets.
For the sake of complete-demonstration, I've modified the df_live
data slightly so that we have one row with no change, one row with updated data, and one row that is new. This process also works unaltered with your original df_live
, I just wanted to highlight the three modes.
Technically, though, the "no change" row does update the database, but it is not obvious. If the table has a "lastmodified" field that updates with the current timestamp when something in the row is updated, then you can see a little more of what is happening.
In fact, I'll add (just for demonstration) two fields: created
and modified
, which show when the row was first created and when the last update occurred. These are not required for normal UPSERTs.
This section should not be necessary, unless you don't have a primary key(s) on the table (in which case, add one).
I'll name the main table "mydata"
, and upload the db_live
dataset into it. I believe (without extensive testing) that MariaDB requires UPSERTs to find the duplicate or conflict rows based on pre-existing keys. This means we'll need to set a (primary) key; I'll assume your table already has this (and show how I do it with manually-uploaded data).
db <- DBI::dbConnect(RMariaDB::MariaDB(), ...)
db_live <- data.frame(id = c(12, 22, 32),
car_name = c("rolls royce","nissan","mercedes benz"),
km = c(123,100,150), temp = c(78,60,80))
df_live <- data.frame(id = c(12, 22, 42),
car_name = c("rolls royce","nissan","aston martin"),
km = c(140,120,40), temp = c(81,65,50))
df_live
# id car_name km temp
# 1 12 rolls royce 140 81 # updated
# 2 22 nissan 100 60 # no change
# 3 42 aston martin 40 50 # new data
DBI::dbWriteTable(db, "mydata", db_live)
DBI::dbExecute(db, "alter table mydata add primary key (id)")
# [1] 0
DBI::dbExecute(db, "
alter table mydata
add column created timestamp not null default CURRENT_TIMESTAMP,
add column modified timestamp null default null")
# [1] 0
DBI::dbExecute(db, "
create trigger updatemodified_mydata
before update on mydata
for each row set NEW.modified = CURRENT_TIMESTAMP")
# [1] 0
DBI::dbGetQuery(db, "select * from mydata")
# id car_name km temp created modified
# 1 12 rolls royce 123 78 2021-06-29 19:11:00 <NA>
# 2 22 nissan 100 60 2021-06-29 19:11:00 <NA>
# 3 32 mercedes benz 150 80 2021-06-29 19:11:00 <NA>
If you don't have a primary key
on the main table mydata
, then the "UPSERT" operation will simply insert (add) all rows without updating. I don't know if there's a way to trick mariadb into faking keys for the purpose of correctly triggering your intended "update if exists" logic.
We'll use a temporary table so that the data to be upserted will not persist; this is good for several reasons, and if you do it correctly then your DBAs will thank you :-)
(If you aren't familiar with temp tables ... they are not visible to other users on the db, often not visible to different connections for the same user, and will be dropped when the connection is closed.)
DBI::dbCreateTable(db, "mytemp", df_live, temporary = TRUE)
DBI::dbWriteTable(db, "mytemp", df_live, append = TRUE, create = FALSE)
DBI::dbGetQuery(db, "select * from mytemp")
# id car_name km temp
# 1 12 rolls royce 140 81
# 2 22 nissan 100 60
# 3 42 aston martin 40 50
DBI::dbExecute(db, "
insert into mydata (id,car_name,km,temp)
select id,car_name,km,temp from mytemp
on duplicate key update km=VALUES(km), temp=VALUES(temp);")
# [1] 5
DBI::dbGetQuery(db, "select * from mydata")
# id car_name km temp created modified
# 1 12 rolls royce 140 81 2021-06-29 19:11:00 2021-06-29 19:11:27
# 2 22 nissan 100 60 2021-06-29 19:11:00 2021-06-29 19:11:27
# 3 32 mercedes benz 150 80 2021-06-29 19:11:00 <NA>
# 4 42 aston martin 40 50 2021-06-29 19:11:27 <NA>
If you note, even though the values for "nissan"
were not different, the row was still allegedly updated as evidenced by the modified
timestamp. The "change" row we had, "rolls royce"
, shows the appropriate modified
time. The mercedes benz
was uploaded the first time and not updated, and the aston martin
was updated the second time, so its created
time is different than the others.
I did this with the mariadb:latest
docker image. These steps below are purely for demonstration, and are not provided as the canonical way to manage a database (for security or performance). Yes I'm connecting to the "mysql"
database, which is not where user data should be going, I believe ... it was hasty, please forgive me.
$ docker pull mariadb:latest
Using default tag: latest
latest: Pulling from library/mariadb
c549ccf8d472: Pull complete
26ea6552a462: Pull complete
329b1f41043f: Pull complete
9f8d09317d80: Pull complete
2bc055a5511d: Pull complete
e989e430508e: Pull complete
cdba2af19f87: Pull complete
04fe4f90eab8: Pull complete
389c6b423e31: Pull complete
bef640655d86: Pull complete
Digest: sha256:0c72b63198ac53df4e84db821876c73794b00509b2d8a77100d186a13e49ac31
Status: Downloaded newer image for mariadb:latest
docker.io/library/mariadb:latest
$ docker run -p 127.0.0.1:3306:3306 --name some-mariadb \
-e MARIADB_ROOT_PASSWORD=mysecretpw -e MARIADB_DATABASE=mydb -d mariadb:latest
In R, connecting is straight-forward:
db <- DBI::dbConnect(RMariaDB::MariaDB(), host="127.0.0.1", port=3306,
username="root", password="mysecretpw", dbname="mydb")
DBI::dbGetQuery(db, "select version() as dbver")
# dbver
# 1 10.5.11-MariaDB-1:10.5.11+maria~focal