rdatedata-import

Converting a date 'year - month - date' to only 'year and month' in r with SQL data


I'm working on a problem where I need to merge two datasets. The first dataset is from SQL and imported using the RODBC library, while the second dataset is imported from Excel. I want to merge the two dataframes by month and year, however in order to do that, I need to convert the first DF's date column into year-month, from year-month-date.

I have tried to use as.Date(df$postingdate, format = '%Y %M' or strftime(df$postingdate,"%Y %m") as I normally would do, however the first doesn't work and the second changes the column to character. It has been a problem for days, and I have tried a number of things, mainly suggestions from the following link: [https://stackoverflow.com/questions/6242955/converting-year-and-month-yyyy-mm-format-to-a-date][1]

In the bottom I have created a df from output I get when using `dput()´ (df2) and I noticed that under posting date, the data is converted to a number, rather than the actual date (“2020-05-28”, “2020-10-09”, "2021-10-19"). Therefor I’m also unsure whatever I have problem because I use the wrong functions, or because the data is of a “unknown” data type.

A sample of the first dataset where I want to transform date into year – month:

df <- data.frame(
  Posting_Date = c("2020-05-28", "2020-10-09", "2021-10-19"), Sales = c(2702.5, 369, 4134),
  Sales_person_code = c(6L, 10L, 10L), EDI = c(1L, 1L, 1L), 
  City = c(141L, 4L, 6L), Kæde = c(12L, 12L, 12L), 
  Vinter = c(0, 0, 0), Forår = c(1, 0, 0), Sommer = c(0, 0, 0), 
  Efterår = c(0, 1, 1), Fredag = c(0, 1, 0), Lørdag = c(0, 0, 0), 
  Mandag = c(0, 0, 0), Onsdag = c(0, 0, 0), Søndag = c(0, 0, 0), 
  Tirsdag = c(0, 0, 1), Torsdag = c(1, 0, 0), 
  year_month = c("2020-05-28", "2020-10-09", "2021-10-19"))

df2 <- data.frame(
  Posting_Date = c(18410, 18544, 18919), Sales = c(2702.5, 369, 4134), 
  Sales_person_code = c(6L, 10L, 10L),EDI = c(1L, 1L, 1L), 
  City = c(141L, 4L, 6L), Kæde = c(12L, 12L, 12L), 
  Vinter = c(0, 0, 0), Forår = c(1, 0, 0), Sommer = c(0, 0, 0), 
  Efterår = c(0, 1, 1), Fredag = c(0, 1, 0), Lørdag = c(0, 0, 0), 
  Mandag = c(0, 0, 0), Onsdag = c(0, 0, 0), Søndag = c(0, 0, 0), 
  Tirsdag = c(0, 0, 1), Torsdag = c(1, 0, 0), 
  year_month = c(18410, 18544, 18919))

Thanks in advance for any help. Plz let me know if i can do anything to help you guys, helping me


Solution

  • Up front, your attempt of as.Date(df$Posting_Date, format="%Y %m") seems backwards: the function as.Date is for converting from a string to a Date-class, and its format= argument is to identify how to find the year/month/day components of the string, not how you want to convert it later. (Note that in R, a Date is shown as YYYY-MM-DD. Always. Telling R you want a date to be just year/month is saying that you want to convert it to a string, no longer date-like or number-like. lubridate and perhaps other packages allow you to have similar-to-Date like objects.)

    For df, one can just subset the strings without parsing to Date-class:

    substring(df$Posting_Date, 1, 7)
    # [1] "2020-05" "2020-10" "2021-10"
    

    If you want to do anything number-like to them, you can convert to Date-class first, and then use format(.) to convert to a string with a specific format.

    as.Date(df$Posting_Date)
    # [1] "2020-05-28" "2020-10-09" "2021-10-19"
    format(as.Date(df$Posting_Date), format = "%Y-%m")
    # [1] "2020-05" "2020-10" "2021-10"
    

    For df2, though, since it is numeric you need to specify an origin= instead of a format=. I'm inferring that these are based off of epoch, so

    as.Date(df2$Posting_Date, origin = "1970-01-01")
    # [1] "2020-05-28" "2020-10-09" "2021-10-19"
    format(as.Date(df2$Posting_Date, origin = "1970-01-01"), format = "%Y-%m")
    # [1] "2020-05" "2020-10" "2021-10"
    

    Note that R stores Date (and POSIXct, incidentally) as numbers internally:

    dput(as.Date(df2$Posting_Date, origin = "1970-01-01"))
    # structure(c(18410, 18544, 18919), class = "Date")