Matrix Math With Sparklyr
Asked Answered
S

1

14

Looking to convert some R code to Sparklyr, functions such as lmtest::coeftest() and sandwich::sandwich(). Trying to get started with Sparklyr extensions but pretty new to the Spark API and having issues :(

Running Spark 2.1.1 and sparklyr 0.5.5-9002

Feel the first step would be to make a DenseMatrix object using the linalg library:

library(sparklyr)
library(dplyr)
sc <- spark_connect("local")

rows <- as.integer(2)
cols <- as.integer(2)
array <- c(1,2,3,4)

mat <- invoke_new(sc, "org.apache.spark.mllib.linalg.DenseMatrix", 
                  rows, cols, array)

This results in the error:

Error: java.lang.Exception: No matched constructor found for class org.apache.spark.mllib.linalg.DenseMatrix

Okay, so I got a java lang exception, I'm pretty sure the rows and cols args were fine in the constructor, but not sure sure about the last one, which is supposed to be a java Array. So I tried a few permutations of:

array <- invoke_new(sc, "java.util.Arrays", c(1,2,3,4))

but end up with a similar error message...

Error: java.lang.Exception: No matched constructor found for class java.util.Arrays

I feel like I'm missing something pretty basic. Anyone know what's up?

Sabelle answered 17/6, 2017 at 6:52 Comment(0)
B
13

R counterpart of the Java Array is list:

invoke_new(
  sc, "org.apache.spark.ml.linalg.DenseMatrix",
  2L, 2L, list(1, 2, 3, 4))

## <jobj[17]>
##   class org.apache.spark.ml.linalg.DenseMatrix
##   1.0  3.0  
## 2.0  4.0  

or

invoke_static(
  sc, "org.apache.spark.ml.linalg.Matrices", "dense",
  2L, 2L, list(1, 2, 3, 4))

## <jobj[19]>
##   class org.apache.spark.ml.linalg.DenseMatrix
##   1.0  3.0  
## 2.0  4.0 

Please note I am using o.a.s.ml.linalg instead of o.a.s.mllib.linalg. While mllib would work in isolation, as of Spark 2.x o.a.s.ml algorithms no longer accept local o.a.s.mllib.

At the same time R vector types (numeric, integer, character) are used as scalars.

Note:

Personally I believe this is not the way to go. Spark linalg packages are quite limited, and internally depend on the libraries, which won't be usable via sparklyr. Moreover sparklyr API is not suitable for complex logic.

In practice it makes more sense to implement Java or Scala extension, with a thin, R friendly wrapper.

Boarhound answered 25/6, 2017 at 13:15 Comment(5)
regarding your note, do you know of any resources for making these extensions? Also any guides showing how to call custom extensions from R?Sabelle
Sorry, I am not. There is of course the official sparklyr guide, but I don't think it is that useful. Overall I think this more about API design. SparkR API is a nice example - with heavy logic implemented in Scala and thin, R-friendly adapters.Boarhound
I very much appreciate your comments. Looks like we'll be doing some Scala programming. I know that we'll need the rank linear algebra method and linalg does not have it.Sabelle
Yeah, if you need proper linear algebra libraries I would go directly to Breeze. Design should be familiar enough if you worked with NumPy or Matlab, and it is already a dependency for Spark.Boarhound
Update many years later. We ended up going the suggested route and made a Spark package written in Scala. Then we just made an R package that calls that Spark library.Sabelle

© 2022 - 2024 — McMap. All rights reserved.