[SOLVED] Dealing with big datasets in R

Dealing with big datasets in R

I'm having a memory problem with R giving the Can not allocate vector of size XX Gb error message. I have a bunch of daily files (12784 days) in netcdf format giving sea surface temperature in a 1305x378 (longitude-latitude) grid. That gives 493290 points each day, decreasing to about 245000 when removing NAs (over land points).

My final objective is to build a time series for any of the 245000 points from the daily files and find the temporal trend for each point. And my idea was to build a big data frame with a point per row and a day per column (2450000x12784) so I could apply the trend calculation to any point. But then, building such data frame, the memory problem appeared, as expected.

First I tried a script I had previously used to read data and extract a three column (lon-lat-sst) dataframe by reading nc file and then melting the data. This lead to an excessive computing time when tried for a small set of days and to the memory problem. Then I tried to subset the daily files into longitudinal slices; this avoided the memory problem but the csv output files were too big and the process was very time consuming.

Another strategy I've tried without success to the moment it's been to sequentially read all the nc files and then extract all the daily values for each point and find the trend. Then I would only need to save a single 245000 points dataframe. But I think this would be time consuming and not the proper R way.

I have been reading about big.memory and ff packages to try to declare big.matrix or a 3D array (1305 x 378 x 12784) but had not success by now.

What would be the appropriate strategy to face the problem?

Extract single point time series to calculate individual trends and populate a smaller dataframe
Subset daily files in slices to avoid the memory problem but end with a lot of dataframes/files
Try to solve the memory problem with bigmemory or ff packages

Thanks in advance for your help

EDIT 1 Add code to fill the matrix

library(stringr)
library(ncdf4)
library(reshape2)
library(dplyr)

# paths
ruta_datos<-"/home/meteo/PROJECTES/VERSUS/CMEMS/DATA/SST/"
ruta_treball<-"/home/meteo/PROJECTES/VERSUS/CMEMS/TREBALL/"
setwd(ruta_treball)

sst_data_full <- function(inputfile) {

  sstFile <- nc_open(inputfile)
  sst_read <- list()

  sst_read$lon <- ncvar_get(sstFile, "lon")
  sst_read$lats <- ncvar_get(sstFile, "lat")
  sst_read$sst <- ncvar_get(sstFile, "analysed_sst")

  nc_close(sstFile)

  sst_read
}

melt_sst <- function(L) {
  dimnames(L$sst) <- list(lon = L$lon, lat = L$lats)
  sst_read <- melt(L$sst, value.name = "sst")
}

# One month list file: This ends with a df of 245855 rows x 33 columns
files <- list.files(path = ruta_datos, pattern = "SST-CMEMS-198201")

sst.out=data.frame()

for (i in 1:length(files) ) { 
  sst<-sst_data_full(paste0(ruta_datos,files[i],sep=""))
  msst <- melt_sst(sst)
  msst<-subset(msst, !is.na(msst$sst))

  if ( i == 1 ) {
  sst.out<-msst
  } else {
  sst.out<-cbind(sst.out,msst$sst)
  }

}

EDIT 2 Code used in a previous (smaller) data frame to calculate temporal trend. Original data was a matrix of temporal series, being each column a series.

library(forecast)

data<-read.csv(....)

for (i in 2:length(data)){

var<-paste("V",i,sep="")
ff<-data$fecha
valor<-data[,i]  
datos2<-as.data.frame(cbind(data$fecha,valor))
datos.ts<-ts(datos2$valor, frequency = 365)

datos.stl <- stl(datos.ts,s.window = 365)

datos.tslm<-tslm(datos.ts ~ trend)

summary(datos.tslm)

output[i-1]<-datos.tslm$coefficients[2]

}

fecha is date variable name

EDIT 2 Working code from F. Privé answer

library(bigmemory)

tmp <- sst_data_full(paste0(ruta_datos,files[1],sep=""))

library(bigstatsr)
mat <- FBM(length(tmp$sst), length(files),backingfile = "/home/meteo/PROJECTES/VERSUS/CMEMS/TREBALL" )

for (i in seq_along(files)) {
  mat[, i] <- sst_data_full(paste0(ruta_datos,files[i],sep=""))$sst
}

With this code a big matrix was created

dim(mat)
[1] 493290  12783
mat[1,1]
[1] 293.05
mat[1,1:10]
[1] 293.05 293.06 292.98 292.96 292.96 293.00 292.97 292.99 292.89 292.97
ncol(mat)
[1] 12783
nrow(mat)
[1] 493290

Solution

So, to your read data in a Filebacked Big Matrix (FBM), you can do

files <- list.files(path = "SST-CMEMS", pattern = "SST-CMEMS-198201*",
                    full.names = TRUE)

tmp <- sst_data_full(files[1])

library(bigstatsr)
mat <- FBM(length(tmp$sst), length(files))

for (i in seq_along(files)) {
  mat[, i] <- sst_data_full(files[i])$sst
}