rmemorydplyrout-of-memoryff

My data.table join exceeds memory limit on 64bit Windows machine with 32gb RAM


Background

I have some data manipulation to do on a very large data.table we'll call d1 (~125 million rows x 10 columns) I read into R from .csv form using fread. The data's about car insurance -- transmission-related and engine-related claims. Here's that reading in:

d1 <- fread("c:/analysis.csv", 
               header = TRUE, 
               na.strings = c("", "NA"),
               data.table = TRUE,
               stringsAsFactors = TRUE)

The work I need to do on d1 is fairly simple dplyr stuff -- a few joins (on much smaller tables), a couple mutate's, etc. Let's call those much smaller tables d2 and d3 for reference; assume those are also data.tables read into R in the same manner as d1 above. I've planned these operations (the joins and mutates) out on much smaller "test" subsets of d1, so I know they'll work.

The machine I'm running R on has 32gb of RAM, a very recent Comet Lake Intel Core i5, and a 1TB NVMe SSD. My versions of Windows and R are 64bit.

For reference, here's how much space each dataset is taking:

> format(object.size(d1), "Gb")
[1] "4 Gb"
> format(object.size(d2), "Mb")
[1] "3.2 Mb"

Here's str(d1) so you can see data types:

Classes ‘data.table’ and 'data.frame':  125640181 obs. of  10 variables:
 $ id                   : int  1551444  ...
 $ service_dt           : IDate, format: "2020-11-22"  ...
 $ service_code         : Factor w/ 41817 levels "316",..
 $ problem_code         : Factor w/ 39531 levels "0",
 $ problem_flag         : int  0 0 0 0 0 0 0 1 1 0 ...
 $ problem_type         : Factor w/ 2 levels "transmission","engine": 1 1 ...
 $ customer_dob         : IDate, format: "1976-04-14" "1980-04-25" ...
 $ customer_gender_cd   : Factor w/ 3 levels "F","M","U": 1 2 ...
 $ customer_zip_cd      : Factor w/ 8354 levels "00000","00003"
 $ old_id               : int  13449983 ...

And str(d2) as well:

'data.frame':   37323 obs. of  4 variables:
 $ service_code     : Factor w/ 36281 levels "00002081702",..: 1 2 3 ...
 $ parts_used       : Factor w/ 215 levels "Disc Brake Rotor (Front)",..: 136 ...
 $ category         : Factor w/ 5 levels "Brakes",..: 1 1 ...
 $ subcategory_1    : Factor w/ 24 levels "Rotors",..: 22 20 ...

The Problem

I go to run this join ...

d1 <- left_join(d1, d2, by = c("service_code" = "service_code"))

... and I get this error: Error: cannot allocate vector of size 277.7 Gb. Note that d2 here looks like this:

> dim(d2)
[1] 37323     4

What I've Tried

So I've read several "out of memory in R" posts on StackOverflow trying to figure out what I can do about this. Using data.tables was one suggestion. I've tried to use gc() to run garbage collection before I run the join; that doesn't work either. If I look at memory limits, I get this:

> memory.limit()
[1] 32502

I take this to mean that R is potentially allocating all of my system's RAM already, and I'm not sure if this can go higher.

In other posts, I read about package ff for large datasets, but I ran into other trouble trying to use it. I've successfully made both my data.tables into ffdf format like so:

d1 <- as.ffdf(d1)
d2 <- as.ffdf(d2)

And then tried to run the join, but get this error: Error in UseMethod("left_join") : no applicable method for 'left_join' applied to an object of class "ffdf".

My concern is that even if I can get this particular ff code to run somehow, the rest of the work I have to do is pretty dplyr-dependent. I'm worried that equivalent functions for what I need to do won't exist in the ff universe, if that makes any sense. Is there a way to keep using ff but keep d1 and d2 (and d3, not shown here) as data.tables?

🚨Edit🚨

After some interaction with @serkan -- see their very helpful answer below -- I discovered that d2 had duplicate values of my join key, d2$service_code. Around 1100 of them, in fact. In hindsight, this was likely the bulk of the reason for R getting tripped up on a vector of size 277.7 Gb on a simple left join. So, I learned something today: when left-joining, ensure that your RHS table's key variable is unique!


Solution

  • I would propose dtplyr for this - in this fashion,

    large_data <- tibble(
            id = 1:125640181,
            value1 = 1:125640181,
            value2 = 1:125640181,
            value3 = 1:125640181,
            value4 = 1:125640181,
            value5 = 1:125640181,
            value6 = 1:125640181,
            value7 = 1:125640181,
            value8 = 1:125640181,
            value9 = 1:125640181
    ) %>% lazy_dt() 
    
    
    small_data <- tibble(
            id = 1:37323,
            value1 = 1:37323,
            value2 = 1:37323,
            value3 = 1:37323
            
    ) %>% lazy_dt() 
    

    And then join by,

    joined_data <- left_join(
            large_data,
            small_data, by = "id"
    ) %>% as_tibble()
    

    Which gives,

    # A tibble: 6 x 13
         id value1.x value2.x value3.x value4 value5 value6 value7 value8 value9 value1.y value2.y value3.y
      <int>    <int>    <int>    <int>  <int>  <int>  <int>  <int>  <int>  <int>    <int>    <int>    <int>
    1     1        1        1        1      1      1      1      1      1      1        1        1        1
    2     2        2        2        2      2      2      2      2      2      2        2        2        2
    3     3        3        3        3      3      3      3      3      3      3        3        3        3
    4     4        4        4        4      4      4      4      4      4      4        4        4        4
    5     5        5        5        5      5      5      5      5      5      5        5        5        5
    6     6        6        6        6      6      6      6      6      6      6        6        6        6
    

    On my machine that has 32 Gb RAM as well. And you can keep all dplyr-functions and syntax while abusing the memory-efficiency of data.table!


    You can read more about it at https://github.com/tidyverse/dtplyr