rmemorymerge

merge two large tables with many to many relationships


I have searched for couple of days on ways to get past a limitation on R

(R error: ℹ 3357726064 rows would be returned. 2147483647 rows is the maximum number allowed.)

I have two tables with column called rollnumber. First table has about 110000 rows and the second tables has about 2.6 million. Each table has repeat rollnumber values however there are no duplicates, I checked.

I'm searching for a creative answer on how to get past the limitation. Thank you.


Solution

  • one of the way is Chunk Processing where you can process the data in chunks to avoid hitting row limit. Which means you have to split the data into manageable chunk, you need to process each chunk separately, and then combine the results

    Here's the code:

    library(dplyr) 
    chunk_size <- 500000 
    num_chunks <- ceiling(nrow(table2) / chunk_size) 
    results <- list() 
    for (i in 1:num_chunks) { 
    start_row <- (i - 1) * chunk_size + 1 
    end_row <- min(i * chunk_size, nrow(table2)) 
    chunk <- table2[start_row:end_row, ] 
    chunk_result <- inner_join(table1, chunk, by = "rollnumber") 
    results[[i]] <- chunk_result } 
    final_result <- bind_rows(results)