I'm having a memory problem with R giving the Can not allocate vector of size XX Gb
error message. I have a bunch of daily files (12784 days) in netcdf format giving sea surface temperature in a 1305x378 (longitude-latitude) grid. That gives 493290 points each day, decreasing to about 245000 when removing NAs (over land points).
My final objective is to build a time series for any of the 245000 points from the daily files and find the temporal trend for each point. And my idea was to build a big data frame with a point per row and a day per column (2450000x12784) so I could apply the trend calculation to any point. But then, building such data frame, the memory problem appeared, as expected.
First I tried a script I had previously used to read data and extract a three column (lon-lat-sst) dataframe by reading nc file and then melting the data. This lead to an excessive computing time when tried for a small set of days and to the memory problem. Then I tried to subset the daily files into longitudinal slices; this avoided the memory problem but the csv output files were too big and the process was very time consuming.
Another strategy I've tried without success to the moment it's been to sequentially read all the nc files and then extract all the daily values for each point and find the trend. Then I would only need to save a single 245000 points dataframe. But I think this would be time consuming and not the proper R way.
I have been reading about big.memory
and ff
packages to try to declare big.matrix or a 3D array (1305 x 378 x 12784) but had not success by now.
What would be the appropriate strategy to face the problem?
Thanks in advance for your help
EDIT 1 Add code to fill the matrix
library(stringr)
library(ncdf4)
library(reshape2)
library(dplyr)
# paths
ruta_datos<-"/home/meteo/PROJECTES/VERSUS/CMEMS/DATA/SST/"
ruta_treball<-"/home/meteo/PROJECTES/VERSUS/CMEMS/TREBALL/"
setwd(ruta_treball)
sst_data_full <- function(inputfile) {
sstFile <- nc_open(inputfile)
sst_read <- list()
sst_read$lon <- ncvar_get(sstFile, "lon")
sst_read$lats <- ncvar_get(sstFile, "lat")
sst_read$sst <- ncvar_get(sstFile, "analysed_sst")
nc_close(sstFile)
sst_read
}
melt_sst <- function(L) {
dimnames(L$sst) <- list(lon = L$lon, lat = L$lats)
sst_read <- melt(L$sst, value.name = "sst")
}
# One month list file: This ends with a df of 245855 rows x 33 columns
files <- list.files(path = ruta_datos, pattern = "SST-CMEMS-198201")
sst.out=data.frame()
for (i in 1:length(files) ) {
sst<-sst_data_full(paste0(ruta_datos,files[i],sep=""))
msst <- melt_sst(sst)
msst<-subset(msst, !is.na(msst$sst))
if ( i == 1 ) {
sst.out<-msst
} else {
sst.out<-cbind(sst.out,msst$sst)
}
}
EDIT 2 Code used in a previous (smaller) data frame to calculate temporal trend. Original data was a matrix of temporal series, being each column a series.
library(forecast)
data<-read.csv(....)
for (i in 2:length(data)){
var<-paste("V",i,sep="")
ff<-data$fecha
valor<-data[,i]
datos2<-as.data.frame(cbind(data$fecha,valor))
datos.ts<-ts(datos2$valor, frequency = 365)
datos.stl <- stl(datos.ts,s.window = 365)
datos.tslm<-tslm(datos.ts ~ trend)
summary(datos.tslm)
output[i-1]<-datos.tslm$coefficients[2]
}
fecha is date variable name
EDIT 2 Working code from F. Privé answer
library(bigmemory)
tmp <- sst_data_full(paste0(ruta_datos,files[1],sep=""))
library(bigstatsr)
mat <- FBM(length(tmp$sst), length(files),backingfile = "/home/meteo/PROJECTES/VERSUS/CMEMS/TREBALL" )
for (i in seq_along(files)) {
mat[, i] <- sst_data_full(paste0(ruta_datos,files[i],sep=""))$sst
}
With this code a big matrix was created
dim(mat)
[1] 493290 12783
mat[1,1]
[1] 293.05
mat[1,1:10]
[1] 293.05 293.06 292.98 292.96 292.96 293.00 292.97 292.99 292.89 292.97
ncol(mat)
[1] 12783
nrow(mat)
[1] 493290
So, to your read data in a Filebacked Big Matrix (FBM), you can do
files <- list.files(path = "SST-CMEMS", pattern = "SST-CMEMS-198201*",
full.names = TRUE)
tmp <- sst_data_full(files[1])
library(bigstatsr)
mat <- FBM(length(tmp$sst), length(files))
for (i in seq_along(files)) {
mat[, i] <- sst_data_full(files[i])$sst
}