I am importing a very large fixed-width dataset into R and wish to use vroom for much better speed. However, the dates in this dataset are in numeric format with either 7 or 8 digits, depending on whether the day of the month has 1 or 2 digits (examples below).
#8 digit date (1985-03-21):
# 21031985
#7 digit date (1985-03-01):
# 1031985
I cannot see any way to specify this type of format using col_date(format = )
as one normally would. It is easy to make a function that converts these 7/8 digit numbers into dates, but doing that means materialising the imported data and removes the speed advantage that vroom provides.
I am looking for a way to have vroom interpret these numbers on its own, or a workaround that does not sacrifice vroom's speed.
Thanks very much for any help here.
Vroom can use a pipe as input. That means you can use a tool like awk
to fix the format (e.g. make it always 8 digit, which is eaasy with sprintf
). That way you can still benefit from vroom streaming the file. You could even use R - but if you are after performance, you need something that can process the file streaming and better be lightweight.
I used a test file test.csv
:
id,date,text
1,1022020,some
2,12042020,more
3,2012020,text
I could read it via (of course the awk call needs to be adjusted for your data - but essentially if you need to just adjust the column $2
means 2nd column, the ','
specifies the separator):
vroom(pipe("awk -F ',' 'BEGIN{OFS=\",\"}; NR==1{print}; NR!=1 {$2=sprintf(\"%08d\",$2);print;}' test.csv"),
col_types=cols(date=col_date(format='%d%m%Y'))
)
giving
# A tibble: 3 × 3
id date text
<int> <date> <chr>
1 1 2020-02-01 some
2 2 2020-04-12 more
3 3 2020-01-02 text