Inconsistent behaviour with tm_map transformation functions when using multiple cores
Asked Answered
A

1

87

Another potential title for this post could be "When parallel processing in R, does the ratio between the number of cores, loop chunk size, and object size matter?"

I have a corpus I am running some transformations on using the tm package. Since the corpus is large I'm using parallel processing with doparallel package.

Sometimes the transformations do the task, but sometimes they don't. For example, tm::removeNumbers(). The very first document in the corpus has a content value of "n417". So if preprocessing is successful then this document will be transformed to just "n".

Sample corpus is shown below for reproduction. Here is the code block:

library(tidyverse)
library(qdap)
library(stringr)
library(tm)
library(textstem)
library(stringi)
library(foreach)
library(doParallel)
library(SnowballC)

  corpus <- (see below)
  n <- 100 # This is the size of each chunk in the loop

  # Split the corpus into pieces for looping to get around memory issues with transformation
  nr <- length(corpus)
  pieces <- split(corpus, rep(1:ceiling(nr/n), each=n, length.out=nr))
  lenp <- length(pieces)

  rm(corpus) # Save memory

  # Save pieces to rds files since not enough RAM
  tmpfile <- tempfile()
  for (i in seq_len(lenp)) {
    saveRDS(pieces[[i]],
            paste0(tmpfile, i, ".rds"))
  }

  rm(pieces) # Save memory

  # Doparallel
  registerDoParallel(cores = 12)
  pieces <- foreach(i = seq_len(lenp)) %dopar% {
    piece <- readRDS(paste0(tmpfile, i, ".rds"))
    # Regular transformations
    piece <- tm_map(piece, content_transformer(removePunctuation), preserve_intra_word_dashes = T)
    piece <- tm_map(piece, content_transformer(function(x, ...)
      qdap::rm_stopwords(x, stopwords = tm::stopwords("english"), separate = F)))
    piece <- tm_map(piece, removeNumbers)
    saveRDS(piece, paste0(tmpfile, i, ".rds"))
    return(1) # Hack to get dopar to forget the piece to save memory since now saved to rds
  }

  stopImplicitCluster()

  # Combine the pieces back into one corpus
  corpus <- list()
  corpus <- foreach(i = seq_len(lenp)) %do% {
    corpus[[i]] <- readRDS(paste0(tmpfile, i, ".rds"))
  }
  corpus_done <- do.call(function(...) c(..., recursive = TRUE), corpus)

And here is the link to sample data. I need to paste a sufficiently large sample of 2k documents to recreate and this won't let me paste that much, so please see the linked document for data.

corpus <- VCorpus(VectorSource([paste the chr vector from link above]))

If I run my code block as above with n to 200 then look at the results.

I can see that numbers remain where they should have been removed by tm::removeNumbers():

> lapply(1:10, function(i) print(corpus_done[[i]]$content)) %>% unlist
[1] "n417"
[1] "disturbance"
[1] "grand theft auto"

However, if I change the chunk size (the value of "n" variable) to 100:

> lapply(1:10, function(i) print(corpus_done[[i]]$content)) %>% unlist
[1] "n"
[1] "disturbance"
[1] "grand theft auto"

The numbers have been removed.

But, this is inconsistent. I tried to narrow it down by testing on 150, then 125 ... and found that it would/would not work between 120 and 125 chunk size. Then after iterating the function between 120:125, it would sometimes work and then not for the same chunk size.

I think maybe there's a relationship to this issue between three variables: the size of the corpus, the chunk size, and the number of cores in registerdoparallel(). I just don't know what it is.

What is the solution? Can this problem be reproduced with the linked sample corpus? I'm concerned since I can reproduce the error sometimes, other times I cannot. Changing the chunk size gives a kind of ability to see the error with remove numbers, but not always.


Update

Today I resumed my session and could not replicate the error. I created a Google Docs document and experimented with differing values for corpus size, number of cores, and chunk sizes. In each case, everything was a success. So, I tried running on full data and everything worked. However, for my sanity, I tried running again on full data and it failed. Now, I'm back to where I was yesterday.

It appears as though have run the function on a larger dataset has changed something ... I don't know what! Perhaps a session variable of some sort?

So, the new information is that this bug only happens after running the function on a very large dataset. Restarting my session did not solve the problem, but resuming the sessions after being away for several hours did.


New information:

It might be easier to reproduce the issue on a larger corpus since this is what seems to trigger the issue corpus <- do.call(c, replicate(250, corpus, simplify = F)) will create a 500k document corpus based on the sample I provided. The function may work the first time you call it but for me, it seems to fail the second time.

This issue is hard because if I could reproduce the problem I would likely be able to identify and fix it.


New information:

As there are several things happening with this function, it was hard to know where to focus on debugging efforts. I was looking at both the fact I'm using multiple temporary RDS files to save memory and also the fact that I'm doing parallel processing. I wrote two alternative versions of the script, one that still uses the rds files and breaks the corpus up but does not do parallel processing (replaced %dopar% with just %do% and also removed registerDoParallel line) and one that uses parallel processing, but does not use RDS temp files to split the small sample corpus up.

I was not able to produce the error with the single-core version of the script, only with the version that uses %dopar% was I able to recreate the issue (though the issue is intermittent, it does not always fail with dopar).

So, this issue only appears when using %dopar%. The fact I'm using temp RDS files does not appear to be part of the problem.

Amplexicaul answered 25/8, 2017 at 6:21 Comment(16)
I don't understand what you call a corpus. You give us only a vector of characters.Diffident
See this block in my post: docs <- (copy from link above) corpus <- VCorpus(VectorSource(docs)) this takes the vector and turns to a corpus. So just wrap everything on the linked doc inside of VCorpus(VectorSource([character vector goes here]))Amplexicaul
@DougFir despite all my respect to tm, I would just recommend to 1) create you own preprocessing function (learn some regex) 2) jump to another package - text2vec or quanteda - will be much faster and easierIncorporated
@DmitriySelivanov just looked at the documentation for quanteda, looks really interesting actually and I might give it a try. This looks like a doparallel issue when used with tm. If I were to use qunateda, if it does process faster, might mean I don't have to use parallel processingAmplexicaul
@DougFir give a chance to text2vec as well (I'm the author :-) ). Check tutorials on text2vec.orgIncorporated
@DmitriySelivanov Ok thanks for the tip! I'll take a look there tooAmplexicaul
I'd give this a try to help but I am not really sure what end result you want from the input character vector in your link above. Is it simply to remove the numbers from the character data, but in a way that is parallelized?Brachial
@KenBenoit the input to the code block above is meant to be a tm corpus. The link in my Gdoc is just a character vector corpus <- VCorpus(VectorSource([paste the chr vector from link above])) . Remove numbers and the other transformations. The issue in a nutshell is that everything seems to work fine when using a single core. The transformations using tm_map on corpus work. However, when using multiple cores the tm_map transformations SOMETIMES work. It's hard to recreate since it appears random. I have noticed that the issue seems to show up after running the code block on a very l...Amplexicaul
(cont) ... on a very large corpus. So if you join the example corpus provided onto itself to make it e.g. 500k or even 1M large, you might find it works the first time. However if you try running a second time the code block might not work and the transformations will appear as none took place. This issue is particularly tricky since reproduction of it is inconsistent. It only sometimes does not work. However, this seems to only be an issue when using multiple cores, otherwise everything works fine (just slow)Amplexicaul
I asked because I think there are much easier, faster, and more scaleable methods to achieve what you want than the approach you are taking. Please state simply what you seek to do: Remove numerals from the text, got it. Anything else?Brachial
Hi @Ken. OK, I would like to remove numbers, punctuation and stopwords from my corpus. (Actually, since posting this I have started using quanteda which appears to use parallel processing (I watched the terminal when running) and everything worked beautifully with no issues. So in actual fact my immediate problem is solved thanks to quanteda. However, I would have loved to understand what was happening above, but appreciate it's likely very tricky to debug since the issue appeared somewhat sporadically and inconsistently. Out of curiosity, which solution would you have suggested?Amplexicaul
Agree with the above comments that tm_map is very difficult to debug, has many poorly documented idiosyncracies, and that using an apply approach with your own custom function vs another package is probably much better in both short and long term.Tedi
I’m voting to close this question because This question is over 3 yesrs old with no answers and the Author has been given clear alternatives in the comments and is clearly no longer interested in an answerAsdic
I have observed inconsistent results from R using Apple Accelerate BLAS, and also using parallel at the same time as some older versions of OpenBLAS. Updating OpenBLAS fixed this issue for me. It's plausible that tm uses some BLAS functions, so this is a possible cause here as well.Circularize
@Circularize it's the second-highest voted unanswered question. There are plenty answered questions voted higherRiflery
I’m voting to close this question because This question is over 3 years old with no answers and the Author has been given clear alternatives in the comments and is clearly no longer interested in an answer. Also, it's the second-most upvoted unanswered R question, so gets more attention than it should.Circularize
C
1

If you try to overwrite your memory with a program that uses parallel processing, you should first verify that it's worth it.

For instance, check if your disk is at 80%-100% writing speed; if that is the case, then your program could also just use a single core, because it is blocked by disk writing speed anyway.

If this is not the case, I recommend you to use the debugger or ad console/GUI outputs to your program to verify that everything gets executed in the right order.

If this does not help, then I recommend that you verify that you did not mess up the program (for example one arrow points in the wrong direction).

Complementary answered 24/11, 2021 at 17:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.