This is a new version of another question posted, now with a reproducible example.
I am trying to convert a document-feature-matrix from 29117 Tweets to a data frame in R, but get the error
"Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105"
The size of the dfm is about 21MB with 29117 rows and 78294 features (words in the tweets splitted up in columns with a 1 or 0 depending if the word occurs in the tweet)
##generel info;
memory.size(max=TRUE)
# [1] 11418.75
sessionInfo()
# R version 3.6.1 (2019-07-05)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 18362)
##install packages, load librarys
# install.packages(c("quanteda", "devtools"))
# devtools::install_github("quanteda/quanteda.corpora")
library("quanteda")
library(RJSONIO)
library(data.table)
library(jsonlite)
library(dplyr)
library(glmnet)
##load data, convert to a dataframe, convert to a dfm
baseurl <- "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/"
d0 <- fromJSON(paste0(baseurl, "2019-10-07.json"), flatten = TRUE)
d1 <- fromJSON(paste0(baseurl, "2019-10-06.json"), flatten = TRUE)
d2 <- fromJSON(paste0(baseurl, "2019-10-05.json"), flatten = TRUE)
d3 <- fromJSON(paste0(baseurl, "2019-10-04.json"), flatten = TRUE)
d4 <- fromJSON(paste0(baseurl, "2019-10-03.json"), flatten = TRUE)
d5 <- fromJSON(paste0(baseurl, "2019-10-02.json"), flatten = TRUE)
d6 <- fromJSON(paste0(baseurl, "2019-10-01.json"), flatten = TRUE)
d7 <- fromJSON(paste0(baseurl, "2019-09-30.json"), flatten = TRUE)
d8 <- fromJSON(paste0(baseurl, "2019-09-29.json"), flatten = TRUE)
d9 <- fromJSON(paste0(baseurl, "2019-09-28.json"), flatten = TRUE)
d10 <- fromJSON(paste0(baseurl, "2019-09-27.json"), flatten = TRUE)
d11 <- fromJSON(paste0(baseurl, "2019-09-26.json"), flatten = TRUE)
d12 <- fromJSON(paste0(baseurl, "2019-09-25.json"), flatten = TRUE)
d <- rbind(d0,d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12)
rm(d0,d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12)
d$text <- as.character(d$text)
dfm <-dfm((corpus(select(d, id, text))), remove_punct=TRUE, remove=c( stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can"))
dfm_df <- convert(dfm, to= 'data.frame')
#Error in asMethod(object) :
#Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
The code below works on a sample of the dataset with 2000 rows (12577 features in the dfm (2MB)).
I need to convert the dfm to a dataframe because I want to add variables and use them in a binary logistic (lasso) regression, as source and whether the tweet is a retweet and contain an url
d_t <- d[c(1:2000), (1:7)]
##code control variable
#url
d_t$url<- as.integer(ifelse(grepl("://", d_t$text), "1", "0"))
#source used
d_t$source_grp[grepl("Twitter for Android", d_t$source)] <- "Twitter for Android"
d_t$source_grp[grepl("Twitter Web Client", d_t$source)] <- "Twitter Web Client"
d_t$source_grp[grepl("Twitter for iPhone", d_t$source)] <- "Twitter for iPhone"
d_t$source_grp[grepl("Twitter for Windows", d_t$source)] <- "Twitter for Windows"
d_t$source_grp[grepl("Twitter for Samsung Tablets", d_t$source)] <- "Samsung Tablets"
d_t$source_grp[grepl("Twitter for Android Tablets", d_t$source)] <- "Android Tablets"
d_t$source_grp[grepl("Twitter for Windows Phone", d_t$source)] <- "Windows Phone"
d_t$source_grp[grepl("Twitter for BlackBerry", d_t$source)] <- "BlackBerry"
d_t$source_grp[grepl("Twitter for iPad", d_t$source)] <- "Twitter for iPad"
d_t$source_grp[grepl("Twitter for Mac", d_t$source)] <- "Twitter for Mac"
d_t$source_grp[is.na(d_t$source_grp)] <- "Other"
#retweet
d_t$retweet <- ifelse(grepl("RT @", d_t$text), "1", "0") #create a variable that takes the value 1 when it is a RT
##create a x and y matrix
x= model.matrix ( retweet~., cbind(select(d_t, retweet, source_grp, url), convert(dfm((corpus(select(d_t, id, text))), remove_punct=TRUE, remove=c( stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can")), to="data.frame")) )[,-1]
y=d_t$retweet
lasso <- cv.glmnet(x=x, y=y, alpha=1, nfolds=5, family="binomial")
I have read other posts saying that the 'problem too large' error is because of the amount of RAM. This data is not quite big and I have tried to create a virtual machine with 30RAM (on a 64 bit windows with 30GB free space), but I still get the same error. I, therefore, wonder if it is the amount of RAM there is the problem, or if there are limits on the number of columns in dataframes in R? I can without any problems add additional DFM's of the same size and larger into the memory.
It is not a solution to reduce the dataset and re-run the code, as this is already a sample. I need to create a dataframe (or something like) from a dfm created from a 6 mio rows dataset (if possible)
Any help/solutions is appreciated, also other ways to add variables to the dfm, without converting it to a dataframe.
Thanks in advance!