Inconsistent behaviour with tm_map transformation functions when using multiple cores

Another potential title for this post could be "When parallel processing in R, does the ratio between the number of cores, loop chunk size, and object size matter?"

I have a corpus I am running some transformations on using the tm package. Since the corpus is large I'm using parallel processing with doparallel package.

Sometimes the transformations do the task, but sometimes they don't. For example, tm::removeNumbers(). The very first document in the corpus has a content value of "n417". So if preprocessing is successful then this document will be transformed to just "n".

Sample corpus is shown below for reproduction. Here is the code block:

library(tidyverse)
library(qdap)
library(stringr)
library(tm)
library(textstem)
library(stringi)
library(foreach)
library(doParallel)
library(SnowballC)

  corpus <- (see below)
  n <- 100 # This is the size of each chunk in the loop

  # Split the corpus into pieces for looping to get around memory issues with transformation
  nr <- length(corpus)
  pieces <- split(corpus, rep(1:ceiling(nr/n), each=n, length.out=nr))
  lenp <- length(pieces)

  rm(corpus) # Save memory

  # Save pieces to rds files since not enough RAM
  tmpfile <- tempfile()
  for (i in seq_len(lenp)) {
    saveRDS(pieces[[i]],
            paste0(tmpfile, i, ".rds"))
  }

  rm(pieces) # Save memory

  # Doparallel
  registerDoParallel(cores = 12)
  pieces <- foreach(i = seq_len(lenp)) %dopar% {
    piece <- readRDS(paste0(tmpfile, i, ".rds"))
    # Regular transformations
    piece <- tm_map(piece, content_transformer(removePunctuation), preserve_intra_word_dashes = T)
    piece <- tm_map(piece, content_transformer(function(x, ...)
      qdap::rm_stopwords(x, stopwords = tm::stopwords("english"), separate = F)))
    piece <- tm_map(piece, removeNumbers)
    saveRDS(piece, paste0(tmpfile, i, ".rds"))
    return(1) # Hack to get dopar to forget the piece to save memory since now saved to rds
  }

  stopImplicitCluster()

  # Combine the pieces back into one corpus
  corpus <- list()
  corpus <- foreach(i = seq_len(lenp)) %do% {
    corpus[[i]] <- readRDS(paste0(tmpfile, i, ".rds"))
  }
  corpus_done <- do.call(function(...) c(..., recursive = TRUE), corpus)

And here is the link to sample data. I need to paste a sufficiently large sample of 2k documents to recreate and this won't let me paste that much, so please see the linked document for data.

corpus <- VCorpus(VectorSource([paste the chr vector from link above]))

If I run my code block as above with n to 200 then look at the results.

I can see that numbers remain where they should have been removed by tm::removeNumbers():

> lapply(1:10, function(i) print(corpus_done[[i]]$content)) %>% unlist
[1] "n417"
[1] "disturbance"
[1] "grand theft auto"

However, if I change the chunk size (the value of "n" variable) to 100:

> lapply(1:10, function(i) print(corpus_done[[i]]$content)) %>% unlist
[1] "n"
[1] "disturbance"
[1] "grand theft auto"

The numbers have been removed.

But, this is inconsistent. I tried to narrow it down by testing on 150, then 125 ... and found that it would/would not work between 120 and 125 chunk size. Then after iterating the function between 120:125, it would sometimes work and then not for the same chunk size.

I think maybe there's a relationship to this issue between three variables: the size of the corpus, the chunk size, and the number of cores in registerdoparallel(). I just don't know what it is.

What is the solution? Can this problem be reproduced with the linked sample corpus? I'm concerned since I can reproduce the error sometimes, other times I cannot. Changing the chunk size gives a kind of ability to see the error with remove numbers, but not always.

Update

Today I resumed my session and could not replicate the error. I created a Google Docs document and experimented with differing values for corpus size, number of cores, and chunk sizes. In each case, everything was a success. So, I tried running on full data and everything worked. However, for my sanity, I tried running again on full data and it failed. Now, I'm back to where I was yesterday.

It appears as though have run the function on a larger dataset has changed something ... I don't know what! Perhaps a session variable of some sort?

So, the new information is that this bug only happens after running the function on a very large dataset. Restarting my session did not solve the problem, but resuming the sessions after being away for several hours did.

New information:

It might be easier to reproduce the issue on a larger corpus since this is what seems to trigger the issue corpus <- do.call(c, replicate(250, corpus, simplify = F)) will create a 500k document corpus based on the sample I provided. The function may work the first time you call it but for me, it seems to fail the second time.

This issue is hard because if I could reproduce the problem I would likely be able to identify and fix it.

New information:

As there are several things happening with this function, it was hard to know where to focus on debugging efforts. I was looking at both the fact I'm using multiple temporary RDS files to save memory and also the fact that I'm doing parallel processing. I wrote two alternative versions of the script, one that still uses the rds files and breaks the corpus up but does not do parallel processing (replaced %dopar% with just %do% and also removed registerDoParallel line) and one that uses parallel processing, but does not use RDS temp files to split the small sample corpus up.

I was not able to produce the error with the single-core version of the script, only with the version that uses %dopar% was I able to recreate the issue (though the issue is intermittent, it does not always fail with dopar).

So, this issue only appears when using %dopar%. The fact I'm using temp RDS files does not appear to be part of the problem.

Recommended topics

Hot tags