rdatedbplyrrpostgresql

Subtracting 1 Year from Date without Lubridate


I have a dataframe in a database with a date dimension that I pulled using RPostgres. Each 'date' is in the format 'YYYY-MM-DD. I'd like to add a new column of dates (labeled 'lookback_date') that are one year earlier than the initial dates.

To be clear, if an observation's 'date' were '2000-01-01', I'd want to add a new 'lookback_date' of '1999-01-01' to that observation. Unfortunately, I can't figure out how to do that. Normally, I'd use Lubridate, but, from what I can tell, it doesn't work with dbplyr. Here's a streamlined version of my code so far. Everything in my actual code works fine up until the mutate function.

# Packages
library(dbplyr)
library(RPostgres)

# Connect to db 
drv <- dbDriver("Postgres")

# Setup connect to db
conn <- dbConnect(drv,
                  dbname = etc,)

# Define table to use in db
table <- tbl(conn, in_schema("xyz", "abc"))

#Select columns and filter
base_data <- table %>%
  #Filter for pertinent data
  filter(date > as.Date("2018-01-01") & date <= as.Date("2020-01-01"))

modified_data <- base_data %>%
mutate(lookback_date = date - 365)

Is there another way I can create this new column of dates?

Thanks!


Solution

  • You are correct that lubridate and dbplyr to not play nicely together (right now). As a result of this, I do most of my dbplyr date manipulation using fragments of sql.

    Based on this answer and this site, the postgresql syntax to add/subtract time from a date is:

    SELECT old_date + INTERVAL '1 day' AS new_date;
    

    Based on this I would try the following:

    output = base_data %>% mutate(lookback_date = date - sql("INTERVAL '1 year'"))
    

    When I do this with a simulated connection, it produces the correct syntax:

    library(dplyr)
    library(dbplyr)
    
    df = data.frame(my_num = c(1,2,3), my_dates = as.Date(c('2000-01-01','2000-02-02','2000-03-03')))
    df = tbl_lazy(df, con = simulate_postgres())
    
    output = df %>% mutate(new_date = my_dates - sql("INTERVAL '1 year'"))
    
    show_query(output)
    # <SQL>
    # SELECT `my_num`, `my_dates`, `my_dates` - INTERVAL '1 year' AS `new_date`
    # FROM `df`
    

    UPDATE: From comment, you first want to convert from date-time to date.

    It appears that dbplyr does support the translation of as.Date to PostgreSQL (as.Date is part of base R, not part of lubridate). Hence you can use the following to cast (convert) a column to date:

    library(dplyr)
    library(dbplyr)
    
    df = data.frame(my_str = c('2000-01-01','2000-02-02','2000-03-03'))
    df = tbl_lazy(df, con = simulate_postgres())
    
    output = df %>% mutate(my_date = as.Date(my_str))
    
    show_query(output)
    # <SQL>
    # SELECT `my_str`, CAST(`my_str` AS DATE) AS `my_date`
    # FROM `df`
    

    It also appears that PostgreSQL does not allow you to add an interval of one year. One alternative to this is to extract the year, month, and day from the date, add one to the year and then recombine.

    Following these two references (postgre date references and date_part fuction) and this answer, you probablywant something like the following:

    output = df %>%
      mutate(the_year = DATE_PART('year', my_date),
             the_month = DATE_PART('month', my_date),
             the_day = DATE_PART('day', my_date)) %>%
      mutate(new_date = MAKE_DATE(the_year + 1, the_month, the_day)