Another potential title for this post could be "When parallel processing in R, does the ratio between the number of cores, loop chunk size, and object size matter?"
I have a corpus I am running some transformations on using the tm package. Since the corpus is large I'm using parallel processing with doparallel package.
Sometimes the transformations do the task, but sometimes they don't. For example, tm::removeNumbers()
. The very first document in the corpus has a content value of "n417". So if preprocessing is successful then this document will be transformed to just "n".
Sample corpus is shown below for reproduction. Here is the code block:
library(tidyverse)
library(qdap)
library(stringr)
library(tm)
library(textstem)
library(stringi)
library(foreach)
library(doParallel)
library(SnowballC)
corpus <- (see below)
n <- 100 # This is the size of each chunk in the loop
# Split the corpus into pieces for looping to get around memory issues with transformation
nr <- length(corpus)
pieces <- split(corpus, rep(1:ceiling(nr/n), each=n, length.out=nr))
lenp <- length(pieces)
rm(corpus) # Save memory
# Save pieces to rds files since not enough RAM
tmpfile <- tempfile()
for (i in seq_len(lenp)) {
saveRDS(pieces[[i]],
paste0(tmpfile, i, ".rds"))
}
rm(pieces) # Save memory
# Doparallel
registerDoParallel(cores = 12)
pieces <- foreach(i = seq_len(lenp)) %dopar% {
piece <- readRDS(paste0(tmpfile, i, ".rds"))
# Regular transformations
piece <- tm_map(piece, content_transformer(removePunctuation), preserve_intra_word_dashes = T)
piece <- tm_map(piece, content_transformer(function(x, ...)
qdap::rm_stopwords(x, stopwords = tm::stopwords("english"), separate = F)))
piece <- tm_map(piece, removeNumbers)
saveRDS(piece, paste0(tmpfile, i, ".rds"))
return(1) # Hack to get dopar to forget the piece to save memory since now saved to rds
}
stopImplicitCluster()
# Combine the pieces back into one corpus
corpus <- list()
corpus <- foreach(i = seq_len(lenp)) %do% {
corpus[[i]] <- readRDS(paste0(tmpfile, i, ".rds"))
}
corpus_done <- do.call(function(...) c(..., recursive = TRUE), corpus)
And here is the link to sample data. I need to paste a sufficiently large sample of 2k documents to recreate and this won't let me paste that much, so please see the linked document for data.
corpus <- VCorpus(VectorSource([paste the chr vector from link above]))
If I run my code block as above with n to 200 then look at the results.
I can see that numbers remain where they should have been removed by tm::removeNumbers()
:
> lapply(1:10, function(i) print(corpus_done[[i]]$content)) %>% unlist
[1] "n417"
[1] "disturbance"
[1] "grand theft auto"
However, if I change the chunk size (the value of "n" variable) to 100:
> lapply(1:10, function(i) print(corpus_done[[i]]$content)) %>% unlist
[1] "n"
[1] "disturbance"
[1] "grand theft auto"
The numbers have been removed.
But, this is inconsistent. I tried to narrow it down by testing on 150, then 125 ... and found that it would/would not work between 120 and 125 chunk size. Then after iterating the function between 120:125, it would sometimes work and then not for the same chunk size.
I think maybe there's a relationship to this issue between three variables: the size of the corpus, the chunk size, and the number of cores in registerdoparallel()
. I just don't know what it is.
What is the solution? Can this problem be reproduced with the linked sample corpus? I'm concerned since I can reproduce the error sometimes, other times I cannot. Changing the chunk size gives a kind of ability to see the error with remove numbers, but not always.
Update
Today I resumed my session and could not replicate the error. I created a Google Docs document and experimented with differing values for corpus size, number of cores, and chunk sizes. In each case, everything was a success. So, I tried running on full data and everything worked. However, for my sanity, I tried running again on full data and it failed. Now, I'm back to where I was yesterday.
It appears as though have run the function on a larger dataset has changed something ... I don't know what! Perhaps a session variable of some sort?
So, the new information is that this bug only happens after running the function on a very large dataset. Restarting my session did not solve the problem, but resuming the sessions after being away for several hours did.
New information:
It might be easier to reproduce the issue on a larger corpus since this is what seems to trigger the issue corpus <- do.call(c, replicate(250, corpus, simplify = F))
will create a 500k document corpus based on the sample I provided. The function may work the first time you call it but for me, it seems to fail the second time.
This issue is hard because if I could reproduce the problem I would likely be able to identify and fix it.
New information:
As there are several things happening with this function, it was hard to know where to focus on debugging efforts. I was looking at both the fact I'm using multiple temporary RDS files to save memory and also the fact that I'm doing parallel processing. I wrote two alternative versions of the script, one that still uses the rds files and breaks the corpus up but does not do parallel processing (replaced %dopar% with just %do% and also removed registerDoParallel line) and one that uses parallel processing, but does not use RDS temp files to split the small sample corpus up.
I was not able to produce the error with the single-core version of the script, only with the version that uses %dopar% was I able to recreate the issue (though the issue is intermittent, it does not always fail with dopar).
So, this issue only appears when using %dopar%
. The fact I'm using temp RDS files does not appear to be part of the problem.
docs <- (copy from link above) corpus <- VCorpus(VectorSource(docs))
this takes the vector and turns to a corpus. So just wrap everything on the linked doc inside ofVCorpus(VectorSource([character vector goes here]))
– Amplexicaultm
, I would just recommend to 1) create you own preprocessing function (learn some regex) 2) jump to another package -text2vec
orquanteda
- will be much faster and easier – Incorporatedtext2vec
as well (I'm the author :-) ). Check tutorials on text2vec.org – Incorporatedcorpus <- VCorpus(VectorSource([paste the chr vector from link above]))
. Remove numbers and the other transformations. The issue in a nutshell is that everything seems to work fine when using a single core. The transformations using tm_map on corpus work. However, when using multiple cores the tm_map transformations SOMETIMES work. It's hard to recreate since it appears random. I have noticed that the issue seems to show up after running the code block on a very l... – Amplexicaultm_map
is very difficult to debug, has many poorly documented idiosyncracies, and that using anapply
approach with your own custom function vs another package is probably much better in both short and long term. – Tediparallel
at the same time as some older versions of OpenBLAS. Updating OpenBLAS fixed this issue for me. It's plausible thattm
uses some BLAS functions, so this is a possible cause here as well. – Circularize