I have a large dataset of datetimes for almost a full year for every second of the year. I am trying to dplyr::left_join
a second dataset that has a datetime column with values within the time range in the first dataset. When I join the dataset, only a small number of records join (about 100 from about 45k) and I know most records should be joining. The checks I'm doing to ensure the columns are the same include:
dput(df_all_dates$date_time[1])
dput(df_subset_dates$date_time[1])
Both of these produce the following:
structure(1485781200, class = c("POSIXct", "POSIXt"), tzone = "")
I've also done the following comparison (the 10
and the 4701
in the following code reflect the same dates in the data):
as.numeric(df_all_dates$date_time[10]) # produces value 1485785900
as.numeric(df_subset_dates$date_time[4701]) # produces value 1485785900
However, in the join, the data from the df_subset_dates does not join into the resulting dataset, even though the datetime values are the same. Is there something else about datetimes that would cause these not to join? Some values do join, but I don't see any pattern as to why those records are different from the ones that do not join.
Here is the code of the actual join, if helpful:
df_all_dates %>%
left_join(df_subset_dates, by = 'date_time')
On my separate dataset, checking both 'by' columns with lubridate::seconds(date_time) showed they were formatted differently due to milliseconds, though it didn't show up in most displays. "1522267608S" vs "1522267308.443S" (these aren't supposed to match, just to show formatting)
Wrapping one or both of the columns in the following to remove milliseconds did the trick for me:
library(lubridate)
as_datetime(floor(seconds(date_time)))
I haven't extensively tested other cases, but floor() worked perfectly for my data. It might be that round() or ceiling() works for others.