I am experimenting with R to analyse some measurement data. I have a .csv file containing more than 2 million lines of measurement. Here is an example:
2014-10-22 21:07:03+00:00,7432442.0
2014-10-22 21:07:21+00:00,7432443.0
2014-10-22 21:07:39+00:00,7432444.0
2014-10-22 21:07:57+00:00,7432445.0
2014-10-22 21:08:15+00:00,7432446.0
2014-10-22 21:08:33+00:00,7432447.0
2014-10-22 21:08:52+00:00,7432448.0
2014-10-22 21:09:10+00:00,7432449.0
2014-10-22 21:09:28+00:00,7432450.0
After reading in the file, I want to convert the time to a correct time, using as.POSIXct()
. For small files this works fine, but for large files it does not.
I made an example by reading in a big file, creating a copy of a small portion and then unleashing the as.POSIXct()
on the correct column. I included an image of the file. As you can see, when applying it to the temp
-variable it does correctl keep the hours, minutes and seconds. However, when applying it to the whole file, only the date is stored. (it also takes a LOT of time (more than 2 minutes))
What could cause this anomality? Is it due to some system limits, since I'm running this on my laptop.
Edit
On my Windows 7 device I run R 3.1.3 which results in this error. However, on Ubuntu 14.01, running R 3.0.2, the times are kept for the large files. Just noticed there is a newer version (3.2.0) for Windows, will update and check if the issue persists.
You can try the code below.
It will:
library(data.table)
data <- fread("C:/RData/house2_electricity_main.csv")
data[, V1 := as.POSIXct(V1)]
There was a question recently about usage of fasttime::fastPOSIXct
instead of as.POSIXct
which can additionally speed up.
As for the title question, having POSIXct you can round it quite freely, e.g. functions year
,month
,mday
...
data[, .SD, by = .(year(V1),month(V1),mday(V1))]