Background
I have some data manipulation to do on a very large data.table
we'll call d1
(~125 million rows x 10 columns) I read into R
from .csv form using fread
. The data's about car insurance -- transmission-related and engine-related claims. Here's that reading in:
d1 <- fread("c:/analysis.csv",
header = TRUE,
na.strings = c("", "NA"),
data.table = TRUE,
stringsAsFactors = TRUE)
The work I need to do on d1
is fairly simple dplyr
stuff -- a few joins (on much smaller tables), a couple mutate
's, etc. Let's call those much smaller tables d2
and d3
for reference; assume those are also data.tables
read into R
in the same manner as d1
above. I've planned these operations (the joins and mutates) out on much smaller "test" subsets of d1
, so I know they'll work.
The machine I'm running R
on has 32gb of RAM, a very recent Comet Lake Intel Core i5, and a 1TB NVMe SSD. My versions of Windows and R
are 64bit.
For reference, here's how much space each dataset is taking:
> format(object.size(d1), "Gb")
[1] "4 Gb"
> format(object.size(d2), "Mb")
[1] "3.2 Mb"
Here's str(d1)
so you can see data types:
Classes ‘data.table’ and 'data.frame': 125640181 obs. of 10 variables:
$ id : int 1551444 ...
$ service_dt : IDate, format: "2020-11-22" ...
$ service_code : Factor w/ 41817 levels "316",..
$ problem_code : Factor w/ 39531 levels "0",
$ problem_flag : int 0 0 0 0 0 0 0 1 1 0 ...
$ problem_type : Factor w/ 2 levels "transmission","engine": 1 1 ...
$ customer_dob : IDate, format: "1976-04-14" "1980-04-25" ...
$ customer_gender_cd : Factor w/ 3 levels "F","M","U": 1 2 ...
$ customer_zip_cd : Factor w/ 8354 levels "00000","00003"
$ old_id : int 13449983 ...
And str(d2)
as well:
'data.frame': 37323 obs. of 4 variables:
$ service_code : Factor w/ 36281 levels "00002081702",..: 1 2 3 ...
$ parts_used : Factor w/ 215 levels "Disc Brake Rotor (Front)",..: 136 ...
$ category : Factor w/ 5 levels "Brakes",..: 1 1 ...
$ subcategory_1 : Factor w/ 24 levels "Rotors",..: 22 20 ...
The Problem
I go to run this join ...
d1 <- left_join(d1, d2, by = c("service_code" = "service_code"))
... and I get this error: Error: cannot allocate vector of size 277.7 Gb
. Note that d2
here looks like this:
> dim(d2)
[1] 37323 4
What I've Tried
So I've read several "out of memory in R" posts on StackOverflow trying to figure out what I can do about this. Using data.table
s was one suggestion. I've tried to use gc()
to run garbage collection before I run the join; that doesn't work either. If I look at memory limits, I get this:
> memory.limit()
[1] 32502
I take this to mean that R
is potentially allocating all of my system's RAM already, and I'm not sure if this can go higher.
In other posts, I read about package ff
for large datasets, but I ran into other trouble trying to use it. I've successfully made both my data.tables
into ffdf
format like so:
d1 <- as.ffdf(d1)
d2 <- as.ffdf(d2)
And then tried to run the join, but get this error: Error in UseMethod("left_join") : no applicable method for 'left_join' applied to an object of class "ffdf"
.
My concern is that even if I can get this particular ff
code to run somehow, the rest of the work I have to do is pretty dplyr
-dependent. I'm worried that equivalent functions for what I need to do won't exist in the ff
universe, if that makes any sense. Is there a way to keep using ff
but keep d1
and d2
(and d3
, not shown here) as data.table
s?
🚨Edit🚨
After some interaction with @serkan -- see their very helpful answer below -- I discovered that d2
had duplicate values of my join key, d2$service_code
. Around 1100 of them, in fact. In hindsight, this was likely the bulk of the reason for R
getting tripped up on a vector of size 277.7 Gb
on a simple left join. So, I learned something today: when left-joining, ensure that your RHS table's key variable is unique!
I would propose dtplyr
for this - in this fashion,
large_data <- tibble(
id = 1:125640181,
value1 = 1:125640181,
value2 = 1:125640181,
value3 = 1:125640181,
value4 = 1:125640181,
value5 = 1:125640181,
value6 = 1:125640181,
value7 = 1:125640181,
value8 = 1:125640181,
value9 = 1:125640181
) %>% lazy_dt()
small_data <- tibble(
id = 1:37323,
value1 = 1:37323,
value2 = 1:37323,
value3 = 1:37323
) %>% lazy_dt()
And then join
by,
joined_data <- left_join(
large_data,
small_data, by = "id"
) %>% as_tibble()
Which gives,
# A tibble: 6 x 13
id value1.x value2.x value3.x value4 value5 value6 value7 value8 value9 value1.y value2.y value3.y
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6 6 6 6 6
On my machine that has 32 Gb
RAM as well. And you can keep all dplyr
-functions and syntax
while abusing the memory-efficiency of data.table
!
You can read more about it at https://github.com/tidyverse/dtplyr