Using R in Apache Spark

The main language for the project seems like an important factor.

If pyspark is a good way to use Spark for you (meaning that you are accessing Spark from Python) accessing R through rpy2 should not make much difference from using any other Python library with a C-extension.

There exist reports of users doing so (although with occasional questions such as How can I partition pyspark RDDs holding R functions or Can I connect an external (R) process to each pyspark worker during setup)

If R is your main language, helping the SparkR authors with feedback or contributions where you feel there are limitation would be way to go.

If your main language is Scala, rscala should be your first try.

While the combo pyspark + rpy2 would seem the most "established" (as in "uses the oldest and probably most-tried codebase"), this does not necessarily mean that it is the best solution (and young packages can evolve quickly). I'd assess first what is the preferred language for the project and try options from there.

Recommended topics

Hot tags