rdatetimedplyrdata.tablestrptime

Why do data.table and dplyr transform some dates into NA?


My data is stored in a .txt file, it looks like this:

"A" "M" "D" "Tmax" "Tmin"
"1" 1931 1 1 35.4 19.7
"2" 1931 1 2 35.8 20.7
"3" 1931 1 3 37.7 23.9
"4" 1931 1 4 39.5 25.8
"5" 1931 1 5 29 22.8
"6" 1931 1 6 27.5 18.6
"7" 1931 1 7 20.4 15.6
"8" 1931 1 8 28.2 15.1
"9" 1931 1 9 31.2 14.6
"10" 1931 1 10 31 18.5

(A = year, M = month, D = day)

Load the data with 'datos_diarios <- read.table(...)', then with the data.table syntax I create a new column named 'fecha' (date):

datos_diarios <- data.table(subset(datos_diarios, A >= 1971))
  datos_diarios[, fecha := strptime(paste0(A,'-',M,'-',D), format= '%Y-%m-%d', tz = 'America/Argentina/Buenos_Aires')]

Surprisingly, I get some NA

which(is.na(datos_diarios$fecha))
[1]  1119  6545  6863  7234  7598  7962 13513 13807

datos_diarios[1118:1120,]
      A M  D Tmax Tmin      fecha
1: 1974 1 22 25.0 21.9 1974-01-22
2: 1974 1 23 25.4 20.6       <NA>
3: 1974 1 24 29.7 20.5 1974-01-24

Also found this strange behaviour with dplyr package. Nevertheless, this problem doesn´t exist using base R

datos_diarios <- subset(datos_diarios, A >= 1971)
  datos_diarios$fecha <- strptime(paste0(datos_diarios$A,'-',datos_diarios$M,'-',datos_diarios$D),format= '%Y-%m-%d', tz = 'America/Argentina/Buenos_Aires')

> which(is.na(datos_horarios$fecha))
integer(0)

> datos_diarios[1118:1120,]
         A M  D Tmax Tmin      fecha
15728 1974 1 22 25.0 21.9 1974-01-22
15729 1974 1 23 25.4 20.6 1974-01-23
15730 1974 1 24 29.7 20.5 1974-01-24

Tried to create a column with date data using data.table & dplyr, expected no NA, got some NA.


Solution

  • As @thelatemail says in comments, this is a daylight saving time issue. https://www.timeanddate.com/time/change/argentina/buenos-aires?year=1974 :

    Clock changes in Buenos Aires, 1974, showing that DST started on Jan 23

    @thelatemail further points out that this is because is data.table (and presumably dplyr?) is automatically converting POSIXlt to POSIXct: we get a warning

    Values of type POSIXlt detected and converted to POSIXct

    Because the POSIX*t types represent time-date objects (not just date), they are defaulting to midnight, which doesn't exist on that day in that timezone. If you use as.Date() instead, everything seems OK. (Still not exactly sure why POSIXt to POSIXct conversion messes things up, since strptime() does work in base R ...)

    dd2[, date := as.Date(paste0(A,'-',M,'-',D),
        format= '%Y-%m-%d', tz = 'America/Argentina/Buenos_Aires')]