Using parLapply and clusterExport inside a function
Asked Answered
U

2

43

I asked a related question here and the response worked well: using parallel's parLapply: unable to access variables within parallel code

The problem is when I try to use the answer inside of the function it won't work as I think it has to the default environment of clusterExport. I've read the vignette and looked at the help file but am approaching this with a very limited knowledge base. The way I used parLapply I expected it to behave similar to lapply but it doesn't appear to.

Here is my attempt:

par.test <- function(text.var, gc.rate=10){ 
    ntv <- length(text.var)
    require(parallel)
    pos <-  function(i) {
        paste(sapply(strsplit(tolower(i), " "), nchar), collapse=" | ")
    }
    cl <- makeCluster(mc <- getOption("cl.cores", 4))
    clusterExport(cl=cl, varlist=c("text.var", "ntv", "gc.rate", "pos"))
    parLapply(cl, seq_len(ntv), function(i) {
            x <- pos(text.var[i])
            if (i%%gc.rate==0) gc()
            return(x)
        }
    )
}

par.test(rep("I like cake and ice cream so much!", 20))

#gives this error message
> par.test(rep("I like cake and ice cream so much!", 20))
Error in get(name, envir = envir) : object 'text.var' not found
Urdar answered 19/8, 2012 at 0:56 Comment(3)
Looks like you need to use the envir argument to clusterExport as varlist is exported from the .GlobalEnv by default. Does envir=environment() work?Bayne
GSee I've monkied around reading and searching for 3 hours. I'm not really that good with environment stuff but that works perfectly. Can you add as an answer and I'll mark as correct.Urdar
I made a blog post on my learning with this for future searchers: trinkerrstuff.wordpress.com/2012/08/19/…Urdar
B
55

By default clusterExport looks in the .GlobalEnv for objects to export that are named in varlist. If your objects are not in the .GlobalEnv, you must tell clusterExport in which environment it can find those objects.

You can change your clusterExport to the following (which I didn't test, but you said works in the comments)

clusterExport(cl=cl, varlist=c("text.var", "ntv", "gc.rate", "pos"), envir=environment())

This way, it will look in the function's environment for the objects to export.

Bayne answered 19/8, 2012 at 1:9 Comment(0)
G
18

Another solution is to include the additional variables as arguments to your function; parLapply exports them too. If 'text.var' is the big data, then it pays to make it the argument that is applied to, rather than an index, because then only the portion of text.var relevant to each worker is exported, rather than the whole object to each worker.

par.test <- function(text.var, gc.rate=10){ 
    require(parallel)
    pos <-  function(i) {
        paste(sapply(strsplit(tolower(i), " "), nchar), collapse=" | ")
    }
    cl <- makeCluster(mc <- getOption("cl.cores", 4))
    parLapply(cl, text.var, function(text.vari, gc.rate, pos) {
        x <- pos(text.vari)
        if (i%%gc.rate==0) gc()
        x
    }, gc.rate, pos)
}

This is also conceptually pleasing. (It's rarely necessary to explicitly invoke the garbage collector).

Memory management when source()ing a script causes additional problems. Compare

> stop("oops")
Error: oops
> traceback()
1: stop("oops")

with the same call in a script

> source("foo.R")
Error in eval(ei, envir) : oops
> traceback()
5: stop("oops") at foo.R#1
4: eval(ei, envir)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("foo.R")

Remember that R's serialize() function, used internally by parLapply() to move data to workers, serializes everything up to the .GlobalEnv. So data objects created in the script are serialized to the worker, whereas if run interactively they would not be serialized. This may account for @SeldeomSeenSlim's problems when running a script. Probably the solution is to more clearly separate 'data' from 'algorithm', e.g., using the file system or data base or ... to store objects.

Grane answered 19/8, 2012 at 5:33 Comment(8)
Very nice answer Martin, this is even closer to lapply's usage. +1 This likely the answer I'll use.Urdar
the function I'm working on uses openNLP and for some reason using gc() is the only way to use the function in an lapply on many cells. I wrote to the authors about this about a year ago. I was informed of a fix for this but couldn't get it to work at the time. Thread on that: talkstats.com/showthread.php/…Urdar
Coming in late :-) -- I discovered (Windows7, R 3.0.2, i7 processor) that after a clusterApply call, even after calling stopCluster and exiting the parent function, no garbage collection had taken place. Thanks for this sample code.Conation
Agree with @CarlWitthoft (though haven't tested it); within the parLapply there's no i.Crematory
I have data sets that are 3-8 Gb each. I'm working on 8-core 24 Gb Windows 7 machine. clusterExport is too slow. Martin's solution is much better in terms of speed but I run out of memory, even if I do garbage collection at every step by placing gc() inside my function. Strangely, apply works the best. Any thoughts?Timeous
I want to seconds Davit Sargasyan issue here. I've got a complex sourced script that initiates a cluster, and runs parLapply a few times throughout the script. The script runs as expected when running it in a non-sourced manner; when I source it, some kind of memory leak forms and the script fails. When I run it non-sourced, it uses maybe 2-3 gb memory total; when I run it sourced it fills up all 32 gb of RAM. I'm at least 98% convinced its an environment issue but I can't track it down.Childbed
@Childbed I added a paragraph to my response; I'm not sure how accurate or helpful it is; a 'best guess' on my part.Grane
@MartinMorgan I've asked a question regarding this here: #54430980 . Its enough of an issue that I think it warrants its own thread.Childbed

© 2022 - 2024 — McMap. All rights reserved.