I wanted to be able to package DataFrames in a Scala jar file and access them in R. The end goal is to create a way to access specific and often-used database tables in Python, R, and Scala without writing a different library for each.
To do this, I made a jar file in Scala with functions that use the SparkSQL library to query the database and get the DataFrames I want. I wanted to be able to call these functions in R without creating another JVM, since Spark already runs on a JVM in R. However, the JVM Spark uses is not exposed in the SparkR API. To make it accessible and make Java methods callable, I modified "backend.R", "generics.R", "DataFrame.R", and "NAMESPACE" in the SparkR package and rebuilt the package:
In "backend.R" I made "callJMethod" and "createJObject" formal methods:
setMethod("callJMethod", signature(objId="jobj", methodName="character"), function(objId, methodName, ...) {
stopifnot(class(objId) == "jobj")
if (!isValidJobj(objId)) {
stop("Invalid jobj ", objId$id,
". If SparkR was restarted, Spark operations need to be re-executed.")
}
invokeJava(isStatic = FALSE, objId$id, methodName, ...)
})
setMethod("newJObject", signature(className="character"), function(className, ...) {
invokeJava(isStatic = TRUE, className, methodName = "<init>", ...)
})
I modified "generics.R" to also contain these functions:
#' @rdname callJMethod
#' @export
setGeneric("callJMethod", function(objId, methodName, ...) { standardGeneric("callJMethod")})
#' @rdname newJobject
#' @export
setGeneric("newJObject", function(className, ...) {standardGeneric("newJObject")})
Then I added exports for these functions to the NAMESPACE file:
export("cacheTable",
"clearCache",
"createDataFrame",
"createExternalTable",
"dropTempTable",
"jsonFile",
"loadDF",
"parquetFile",
"read.df",
"sql",
"table",
"tableNames",
"tables",
"uncacheTable",
"callJMethod",
"newJObject")
This allowed me to call the Scala functions I wrote without starting a new JVM.
The scala methods I wrote return DataFrames, which are "jobj"s in R when returned, but a SparkR DataFrame is an environment + a jobj. To turn these jobj DataFrames into SparkR DataFrames, I used the dataFrame() function in "DataFrame.R", which I also made accessible following the steps above.
I was then able to access the DataFrame that I "built" in Scala from R and use all of SparkR's functions on that DataFrame. I was wondering if there was a better way to make such a cross-language library, or if there is any reason the Spark JVM should not be public?