Several people (1, 2, 3) have discussed using a Scala UDF in a PySpark application, usually for performance reasons. I am interested in the opposite - using a python UDF in a Scala Spark project.
I am particularly interested in building a model using sklearn (and MLFlow) then efficiently applying that to records in a Spark streaming job. I know I could also host the python model behind a REST API and make calls to that API in the Spark streaming application in mapPartitions
, but managing concurrency for that task and setting up the API for hosted model isn't something I'm super excited about.
Is this possible without too much custom development with something like Py4J? Is this just a bad idea?
Thanks!
private[spark]
, but you can still access it by putting a wrapper object insideorg.apache.spark
. There is a lot of fanciness inside that function that you may not need. If you don't need udfs, a simplemapPartitions
(spinning up a python process per partition) may suffice to call your code. – Periostitis