rmemory-managementdata.tableffr-bigmemory

Still struggling with handling large data set


I have been reading around on this website and haven't been able to find the exact answer. If it already exists, I apologize for the repost.

I am working with data sets that are extremely large (600 million rows, 64 columns on a computer with 32 GB of RAM). I really only need much smaller subsets of this data, but am struggling to perform any functions besides simply importing one data set in with fread, and selecting the 5 columns I need. After that, I try to overwrite my dataset with the specific conditions I need, but I hit my RAM cap and get the message "Error: cannot allocate vector size of 4.5 GB. I looked at ff and bigmemory packages as alternatives, but it seems like you can't subset before importing in those packages? Is there any solution to this problem besides upgrading RAM on computer?

Tasks I am trying to perform:

>SampleTable<-fread("my.csv", header = T, sep = ",", select=c("column1", "column2", "column7", "column12", "column15"))

>SampleTable2<-SampleTable[SampleTable[,column1=="6" & column7=="1"]]

At this point, I hit my memory cap. Would it be better to try and use another package but import all 64 columns of 600 million rows? I also don't want to spend hours upon hours just to perform one import.


Solution

  • What you could do is read the CSV file in chunks:

    # Define only the subset of columns
    csv <- "my.csv"
    colnames <- names(read.csv(csv, header = TRUE, nrows = 1))
    colclasses <- rep(list(NULL), length(colnames))
    ind <- c(1, 2, 7, 12, 15)
    colclasses[ind] <- "double"
    
    # Read header and first line
    library(dplyr)
    l_df <- list()
    con <- file(csv, "rt")
    df <- read.csv(con, header = TRUE, nrows = 1, colClasses = colclasses) %>%
      filter(V1 == 6, V7 == 1)
    names(df) <- paste0("V", ind)
    l_df[[i <- 1]] <- df
    
    # Read all other lines and combine
    repeat {
      i <- i + 1
      df <- read.csv(con, header = FALSE, nrows = 9973, colClasses = colclasses)
      l_df[[i]] <- filter(df, V1 == 6, V7 == 1)
      if (nrow(df) < 9973) break
    }
    df <- do.call("rbind", l_df)
    

    9973 is an arbitrary prime number which has few chance to be a divisor of nlines - 1.