Using SparkR JVM to call methods from a Scala jar file

setMethod("callJMethod", signature(objId="jobj", methodName="character"), function(objId, methodName, ...) { stopifnot(class(objId) == "jobj") if (!isValidJobj(objId)) { stop("Invalid jobj ", objId$id, ". If SparkR was restarted, Spark operations need to be re-executed.") } invokeJava(isStatic = FALSE, objId$id, methodName, ...) }) setMethod("newJObject", signature(className="character"), function(className, ...) { invokeJava(isStatic = TRUE, className, methodName = "<init>", ...) })

#' @rdname callJMethod #' @export setGeneric("callJMethod", function(objId, methodName, ...) { standardGeneric("callJMethod")}) #' @rdname newJobject #' @export setGeneric("newJObject", function(className, ...) {standardGeneric("newJObject")})

export("cacheTable", "clearCache", "createDataFrame", "createExternalTable", "dropTempTable", "jsonFile", "loadDF", "parquetFile", "read.df", "sql", "table", "tableNames", "tables", "uncacheTable", "callJMethod", "newJObject")

any reason the Spark JVM should not be public?

Probably more than one. Spark developers make serious efforts to provide a stable public API. Low details of the implementation, including the way how guest languages communicate with JVM, are simply not a part of the contract. It could be completely rewritten at any point without any negative impact on the users. If you decide to use it and there are backwards incompatible changes you're on your own.

Keeping internals private reduces the effort to maintain and support software. You simply don't have bother yourself with all possible ways in which user can abuse these.

a better way to make such a cross-language library

It is hard to say without knowing more about your use case. I see at least three options:

For starters R provides only a weak access control mechanisms. If any part of the API is internal you can always use ::: function to access it. As smart people say:

It is typically a design mistake to use ::: in your code since the corresponding object has probably been kept internal for a good reason.

but one thing for sure it is much better than modifying Spark source. As a bonus it clearly marks parts of your code which are particularly fragile an potentially unstable.

if all you want is to create DataFrames the simplest thing is to use raw SQL. It is clean, portable, requires no compilation, packaging and simply works. Assuming you have query string like below stored in the variable named q

CREATE TEMPORARY TABLE foo
USING org.apache.spark.sql.jdbc
OPTIONS (
    url "jdbc:postgresql://localhost/test",
    dbtable "public.foo",
    driver "org.postgresql.Driver"
)

it can be used in R:

sql(sqlContext, q)
fooDF <- sql(sqlContext, "SELECT * FROM foo")

Python:

sqlContext.sql(q)
fooDF = sqlContext.sql("SELECT * FROM foo")

Scala:

sqlContext.sql(q)
val fooDF = sqlContext.sql("SELECT * FROM foo")

or directly in Spark SQL.

finally you can use Spark Data Sources API for consistent and supported cross-platform access.

Out of these three I would prefer raw SQL, followed by Data Sources API for complex cases and leave internals as a last resort.

Edit (2016-08-04):

If you're interested in low level access to JVM there is relatively new package rstudio/sparkapi which exposes internal SparkR RPC protocol. It is hard to predict how it will evolve so use it at your own risk.

Recommended topics

Hot tags