rdataframememoryquantedadfm

What does the Cholmod error 'problem too large' means exactly? Problem when converting a dfm to a df


This is a new version of another question posted, now with a reproducible example.

I am trying to convert a document-feature-matrix from 29117 Tweets to a data frame in R, but get the error

"Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105"

The size of the dfm is about 21MB with 29117 rows and 78294 features (words in the tweets splitted up in columns with a 1 or 0 depending if the word occurs in the tweet)

##generel info;
memory.size(max=TRUE)
# [1] 11418.75
sessionInfo()
# R version 3.6.1 (2019-07-05)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 18362)

##install packages, load librarys
# install.packages(c("quanteda", "devtools"))
# devtools::install_github("quanteda/quanteda.corpora")
library("quanteda")
library(RJSONIO)
library(data.table)
library(jsonlite)
library(dplyr)
library(glmnet)

##load data, convert to a dataframe, convert to a dfm

baseurl <- "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/"
d0 <- fromJSON(paste0(baseurl, "2019-10-07.json"), flatten = TRUE)
d1 <- fromJSON(paste0(baseurl, "2019-10-06.json"), flatten = TRUE)
d2 <- fromJSON(paste0(baseurl, "2019-10-05.json"), flatten = TRUE)
d3 <- fromJSON(paste0(baseurl, "2019-10-04.json"), flatten = TRUE)
d4 <- fromJSON(paste0(baseurl, "2019-10-03.json"), flatten = TRUE)
d5 <- fromJSON(paste0(baseurl, "2019-10-02.json"), flatten = TRUE)
d6 <- fromJSON(paste0(baseurl, "2019-10-01.json"), flatten = TRUE)
d7 <- fromJSON(paste0(baseurl, "2019-09-30.json"), flatten = TRUE)
d8 <- fromJSON(paste0(baseurl, "2019-09-29.json"), flatten = TRUE)
d9 <- fromJSON(paste0(baseurl, "2019-09-28.json"), flatten = TRUE)
d10 <- fromJSON(paste0(baseurl, "2019-09-27.json"), flatten = TRUE)
d11 <- fromJSON(paste0(baseurl, "2019-09-26.json"), flatten = TRUE)
d12 <- fromJSON(paste0(baseurl, "2019-09-25.json"), flatten = TRUE)

d <- rbind(d0,d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12)

rm(d0,d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12)

d$text <- as.character(d$text)

dfm <-dfm((corpus(select(d, id, text))), remove_punct=TRUE, remove=c(  stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can"))

dfm_df <- convert(dfm, to= 'data.frame')

#Error in asMethod(object) : 
  #Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

The code below works on a sample of the dataset with 2000 rows (12577 features in the dfm (2MB)).

I need to convert the dfm to a dataframe because I want to add variables and use them in a binary logistic (lasso) regression, as source and whether the tweet is a retweet and contain an url


d_t <- d[c(1:2000), (1:7)]

##code control variable

#url

d_t$url<- as.integer(ifelse(grepl("://", d_t$text), "1", "0"))

#source used
d_t$source_grp[grepl("Twitter for Android", d_t$source)] <- "Twitter for Android"
d_t$source_grp[grepl("Twitter Web Client", d_t$source)] <- "Twitter Web Client"
d_t$source_grp[grepl("Twitter for iPhone", d_t$source)] <- "Twitter for iPhone"
d_t$source_grp[grepl("Twitter for Windows", d_t$source)] <- "Twitter for Windows"
d_t$source_grp[grepl("Twitter for Samsung Tablets", d_t$source)] <- "Samsung Tablets"
d_t$source_grp[grepl("Twitter for Android Tablets", d_t$source)] <- "Android Tablets"
d_t$source_grp[grepl("Twitter for Windows Phone", d_t$source)] <- "Windows Phone"
d_t$source_grp[grepl("Twitter for BlackBerry", d_t$source)] <- "BlackBerry"
d_t$source_grp[grepl("Twitter for iPad", d_t$source)] <- "Twitter for iPad"
d_t$source_grp[grepl("Twitter for Mac", d_t$source)] <- "Twitter for Mac"
d_t$source_grp[is.na(d_t$source_grp)] <- "Other"   

#retweet

d_t$retweet <- ifelse(grepl("RT @", d_t$text), "1", "0") #create a variable that takes the value 1 when it is a RT

##create a x and y matrix

x= model.matrix ( retweet~., cbind(select(d_t, retweet, source_grp, url), convert(dfm((corpus(select(d_t, id, text))), remove_punct=TRUE, remove=c(  stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can")), to="data.frame")) )[,-1]

y=d_t$retweet

lasso <- cv.glmnet(x=x, y=y, alpha=1, nfolds=5, family="binomial")


I have read other posts saying that the 'problem too large' error is because of the amount of RAM. This data is not quite big and I have tried to create a virtual machine with 30RAM (on a 64 bit windows with 30GB free space), but I still get the same error. I, therefore, wonder if it is the amount of RAM there is the problem, or if there are limits on the number of columns in dataframes in R? I can without any problems add additional DFM's of the same size and larger into the memory.

It is not a solution to reduce the dataset and re-run the code, as this is already a sample. I need to create a dataframe (or something like) from a dfm created from a 6 mio rows dataset (if possible)

Any help/solutions is appreciated, also other ways to add variables to the dfm, without converting it to a dataframe.

Thanks in advance!


Solution

  • The problem is that you are trying to convert a sparse matrix (dfm) into a dense object. In your case this has dimensions of:

    > dfm
    Document-feature matrix of: 29,117 documents, 78,294 features (100.0% sparse).
    
    > prod(dim(dfm))
    [1] 2279686398
    

    or 2.3 billion cells, which is why the error occurs. The object is extremely sparse, which is why it's not a problem as a dfm, but explodes when you try to record so many zeros in a matrix. Most of the object is empty:

    > sparsity(dfm)
    [1] 0.9996795
    

    meaning that 99.97% of the cells are zeros. Even if you could create the data.frame, fitting a LASSO model is not going to work because of this extreme lack of information in the features.

    Solution? Trim some features.

    This works, at least on my machine:

    > dfmtrimmed <- dfm_trim(dfm, min_docfreq = 10, min_termfreq = 20, verbose = TRUE)
    Removing features occurring: 
      - fewer than 20 times: 73,573
      - in fewer than 10 documents: 70,697
      Total features removed: 73,573 (94.0%).
    > dfmtrimmed
    Document-feature matrix of: 29,117 documents, 4,721 features (99.6% sparse).
    
    > nrow(convert(dfmtrimmed, to = "data.frame"))
    [1] 29117
    

    But this is still 99.6% sparse so it makes a lot more sense to trim more aggressively.