Options for deploying R models in production
Asked Answered
G

6

46

There doesn't seem to be too many options for deploying predictive models in production which is surprising given the explosion in Big Data.

I understand that the open-source PMML can be used to export models as an XML specification. This can then be used for in-database scoring/prediction. However it seems that to make this work you need to use the PMML plugin by Zementis which means the solution is not truly open source. Is there an easier open way to map PMML to SQL for scoring?

Another option would be to use JSON instead of XML to output model predictions. But in this case, where would the R model sit? I'm assuming it would always need to be mapped to SQL...unless the R model could sit on the same server as the data and then run against that incoming data using an R script?

Any other options out there?

Gnosticism answered 10/3, 2014 at 19:15 Comment(3)
Because "big data" is just data warehousing 2.0 - people don't really do anything fancy like classification on really large data. Then you wouldn't be using R, because it's too slow.Lighting
Look at yhathq.com.Flatulent
gist.github.com/shanebutler/5456942 for r gbm to SQL gist.github.com/shanebutler/96f0e78a02c84cdcf558 for r random forest to SQLSurefire
R
16

The answer really depends on what your production environment is.

If your "big data" are on Hadoop, you can try this relatively new open source PMML "scoring engine" called Pattern.

Otherwise you have no choice (short of writing custom model-specific code) but to run R on your server. You would use save to save your fitted models in .RData files and then load and run corresponding predict on the server. (That is bound to be slow but you can always try and throw more hardware at it.)

How you do that really depends on your platform. Usually there is a way to add "custom" functions written in R. The term is UDF (user-defined function). In Hadoop you can add such functions to Pig (e.g. https://github.com/cd-wood/pigaddons) or you can use RHadoop to write simple map-reduce code that would load the model and call predict in R. If your data are in Hive, you can use Hive TRANSFORM to call external R script.

There are also vendor-specific ways to add functions written in R to various SQL databases. Again look for UDF in the documentation. For instance, PostgreSQL has PL/R.

Riverhead answered 11/3, 2014 at 1:9 Comment(0)
T
24

The following is a list of the alternatives that I have found so far to deploy an R model in production. Please note that the workflow to use these products varies significantly between each other, but they are all somehow oriented to facilitate the process of exposing a trained R model as a service:

Topmast answered 23/11, 2014 at 19:59 Comment(1)
You have to be aware that AzureML does not let you analyze date unless it is in a table of some sort. It is a very frustrating tool to use, and very limited unless you have beautiful CSV-files and only need to use very basic packages. Installing many useful packages is very hard as R is version 3.1, and only 400 packages are pre-installed. Other backages have to be installed by downgrading your local R, installing compatible packages, exporting them as a doubly zipped files that needs to have a special names, and yet it only works sometimes. If you can avoid AzureML I would.Bordelaise
R
16

The answer really depends on what your production environment is.

If your "big data" are on Hadoop, you can try this relatively new open source PMML "scoring engine" called Pattern.

Otherwise you have no choice (short of writing custom model-specific code) but to run R on your server. You would use save to save your fitted models in .RData files and then load and run corresponding predict on the server. (That is bound to be slow but you can always try and throw more hardware at it.)

How you do that really depends on your platform. Usually there is a way to add "custom" functions written in R. The term is UDF (user-defined function). In Hadoop you can add such functions to Pig (e.g. https://github.com/cd-wood/pigaddons) or you can use RHadoop to write simple map-reduce code that would load the model and call predict in R. If your data are in Hive, you can use Hive TRANSFORM to call external R script.

There are also vendor-specific ways to add functions written in R to various SQL databases. Again look for UDF in the documentation. For instance, PostgreSQL has PL/R.

Riverhead answered 11/3, 2014 at 1:9 Comment(0)
O
8

You can create RESTful APIs for your R scripts using plumber (https://github.com/trestletech/plumber).

I wrote a blog post about it (http://www.knowru.com/blog/how-create-restful-api-for-machine-learning-credit-model-in-r/) using deploying credit models as an example.

In general, I do not recommend PMML because the packages you used might not support translation to PMML.

Olivares answered 2/3, 2017 at 21:25 Comment(0)
P
2

A common practice is scoring a new/updated dataset in R and moving only the results (IDs, scores, probabilities, other necessary fields) into the production environment/data warehouse.

I know this has its limitations (infrequent refreshes, reliance upon IT, data set size/computing power restrictions) and may not be the cutting edge answer many (of your bosses) are looking for; but for many use-cases this works well (and is cost friendly!).

Phenomenon answered 16/11, 2016 at 14:57 Comment(0)
G
2

It’s been a few years since the question was originally asked.

For rapid prototyping I would argue the easiest approach currently is to use the Jupyter Kernel Gateway. This allows you to add REST endpoints to any cell in your Jupyter notebook. This works for both R and Python, depending on the kernel you’re using.

This means you can easily call any R or Python code through a web interface. When used in conjunction with Docker it lends itself to a microservices approach to deploying and scaling your application.

Here’s an article that takes you from start to finish to quickly set up your Jupyter Notebook with the Jupyter Kernel Gateway.

Learn to Build Machine Learning Services, Prototype Real Applications, and Deploy your Work to Users

For moving solutions to production the leading approach in 2019 is to use Kubeflow. Kubeflow was created and is maintained by Google, and makes "scaling machine learning (ML) models and deploying them to production as simple as possible."

From their website:

You adapt the configuration to choose the platforms and services that you want to use for each stage of the ML workflow: data preparation, model training, prediction serving, and service management. You can choose to deploy your workloads locally or to a cloud environment.

Gnosticism answered 16/10, 2018 at 19:39 Comment(0)
H
1

Elise from Yhat here.

Like @Ramnath and @leo9r mentioned, our software allows you to put any R (or Python, for that matter) model directly into production via REST API endpoints.

We handle real-time or batch, as well as all of the model testing and versioning + systems management associated with the process.

This case study we co-authored with VIA SMS might be useful if you're thinking about how to get R models into production (their data sci team was recoding into PHP prior to using Yhat).

Cheers!

Healion answered 6/10, 2016 at 17:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.