How to use Jupyter + SparkR and custom R install

Asked 18/9, 2017 at 18:33 Answered 6/3, 2018 at 14:34

I am using a Dockerized image and Jupyter notebook along with SparkR kernel. When I create a SparkR notebook, it uses an install of Microsoft R (3.3.2) instead of vanilla CRAN R install (3.2.3).

The Docker image I'm using installs some custom R libraries and Python pacakages but I don't explicitly install Microsoft R. Regardless of whether or not I can remove Microsoft R or have it side-by-side, how I can get my SparkR kernel to use a custom installation of R?

Expulsion answered 18/9, 2017 at 18:33 Comment(0)

Docker-related issues aside, the settings for Jupyter kernels are configured in files named kernel.json, residing in specific directories (one per kernel) which can be seen using the command jupyter kernelspec list; for example, here is the case in my (Linux) machine:

$ jupyter kernelspec list
Available kernels:
  python2       /usr/lib/python2.7/site-packages/ipykernel/resources
  caffe         /usr/local/share/jupyter/kernels/caffe
  ir            /usr/local/share/jupyter/kernels/ir
  pyspark       /usr/local/share/jupyter/kernels/pyspark
  pyspark2      /usr/local/share/jupyter/kernels/pyspark2
  tensorflow    /usr/local/share/jupyter/kernels/tensorflow

Again, as an example, here are the contents of the kernel.json for my R kernel (ir)

{
  "argv": ["/usr/lib64/R/bin/R", "--slave", "-e", "IRkernel::main()", "--args", "{connection_file}"],
  "display_name": "R 3.3.2",
  "language": "R"
}

And here is the respective file for my pyspark2 kernel:

{
 "display_name": "PySpark (Spark 2.0)",
 "language": "python",
 "argv": [
  "/opt/intel/intelpython27/bin/python2",
  "-m",
  "ipykernel",
  "-f",
  "{connection_file}"
 ],
 "env": {
  "SPARK_HOME": "/home/ctsats/spark-2.0.0-bin-hadoop2.6",
  "PYTHONPATH": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python:/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip",
  "PYTHONSTARTUP": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/pyspark/shell.py",
  "PYSPARK_PYTHON": "/opt/intel/intelpython27/bin/python2"
 }
}

As you can see, in both cases the first element of argv is the executable for the respective language - in my case, GNU R for my ir kernel and Intel Python 2.7 for my pyspark2 kernel. Changing this, so that it points to your GNU R executable, should resolve your issue.

Holland answered 21/9, 2017 at 14:30 Comment(3)

Briliant!! Thanks! – Schleiermacher 6/5, 2019 at 14:48

Yup, looks like I'm the only up vote 😁. But I like to focus on quality and not quantity. – Schleiermacher 6/5, 2019 at 15:44

Nice Job explaining the details. – Dayton 17/5, 2019 at 17:16

To use a custom R environment I believe you need to set the following application properties when you start Spark:

    "spark.r.command": "/custom/path/bin/R",
    "spark.r.driver.command": "/custom/path/bin/Rscript",
    "spark.r.shell.command" : "/custom/path/bin/R"

This is more completely documented here: https://spark.apache.org/docs/latest/configuration.html#sparkr

Shriek answered 6/3, 2018 at 14:34 Comment(0)

Recommended topics

Hot tags