rdataframegoogle-sheetsdplyrr-googlesheets

How to append new data in googlesheet


I have a below mentioned dataframe in R which I'm fetching from database using RMySQL on interval of 3 hours.

Query:

Select Form_No, Date, Name, Age from Test where Date(Date)>='2020-12-01';

DF:

Form_No      Date                  Name            Age
POS-1        2020-12-01 10:16:12   Alex            29
POS-2        2020-12-01 10:18:34   Mike            31
POS-3        2020-12-01 14:16:22   Paul            21
POS-4        2020-12-01 18:33:14   John            27
POS-5        2020-12-01 20:13:34   Cohe            45

It is oblivious that after every 3 hours when I run the script there are some additional entries might have created in dataframe and when I use the below-mentioned code to upload the data in googlesheet it override the previous information and update the new data.

The Problem is that for each rows there are some observations might have captured by the reviewer of the information.

The Code I am using is:

library(dplyr)
library(stringr)
library(RMySQL)
library(googlesheets)

connection<-dbConnect(MySQL(), user='User', password='Password', 
                 dbname='Db name',
                 host='host info')

Query<- paste0("Select Form_No, Date, Name, Age from Test where Date(Date)>='2020-12-01';")
DF <- dbGetQuery(connection,Query)

setwd("/home/path")
write.csv(x = DF, file = "cust_data.csv", row.names = FALSE)

as<-gs_title("TargetSheet")
gs_upload("cust_data.csv", sheet_title = "TargetSheet", overwrite = TRUE)

I want to keep the sequence of Form_No and every details same while uploading the new batch of information.

For Example, If I had run the query at 06:00 AM and there were five entry as shown in DF and all of them uploaded to google sheet, now if I run the script at 09:00 am then in my query there is possibility that along with those five entry there are some more entries.

I need to append those extra entries in google sheet now except for the five entries which are already uploaded last time.


Solution

  • Suppose you've saved your query into a Google Sheet named "DatabaseOutput": enter image description here

    You can read the data with googlesheets4::read_sheet. To insure you get the correct column types every time, I recommend supplying the col_types = argument. It is very important that read_sheet returns the same column types as your SQL query returns. You can find more information about the argument on the readr documentation page.

    library(googledrive)
    library(googlesheets4)
    OldSheet <- read_sheet(drive_get("DatabaseOutput"),
                           col_types = "cTci")
    OldSheet
    ## A tibble: 3 x 4
    #  Form_No Date                Name    Age
    #  <chr>   <dttm>              <chr> <dbl>
    #1 POS-1   2020-12-01 10:16:12 Alex     29
    #2 POS-2   2020-12-01 10:18:34 Mike     31
    #3 POS-3   2020-12-01 14:16:22 Paul     21
    

    Your SQL query and read from Google Sheets may disagree on exactly what time it is, so I recommend flooring the Date column with lubridate:

    library(dplyr)
    library(lubridate)
    DF <- as_tibble(DF) %>% 
      mutate(Date = as_datetime(floor(seconds(Date))))
    OldSheet <- OldSheet %>%
      mutate(Date = as_datetime(floor(seconds(Date))))
    

    Now we can find rows which are not in the old data with dplyr::anti_join. If this doesn't work, double check that the column types are all the same.

    NewRows <- anti_join(DF,OldSheet)
    NewRows
    ## A tibble: 2 x 4
    #  Form_No Date                Name    Age
    #  <chr>   <dttm>              <chr> <int>
    #1 POS-4   2020-12-01 18:33:14 John     27
    #2 POS-5   2020-12-01 20:13:34 Cohe     45
    

    Now you can append the rows to the Google Sheet with sheet_append:

    sheet_append(drive_get("DatabaseOutput"),NewRows)
    

    enter image description here