How to use XGboost in PySpark Pipeline

Asked 30/5, 2018 at 10:26 Answered 17/9, 2019 at 20:7

apache-spark pyspark apache-spark-mllib xgboost apache-spark-ml

I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be able to use XGboost model in the pipeline api. How can I use the pyspark like this

from xgboost import XGBClassifier
...
model = XGBClassifier()
model.fit(X_train, y_train)
pipeline = Pipeline(stages=[..., model, ...])
...

It is convenient to use the pipeline api, so can anybody give some advices? Thanks.

Loughlin answered 30/5, 2018 at 10:26 Comment(0)

There is no XGBoost classifier in Apache Spark ML (as of version 2.3). Available models are listed here : https://spark.apache.org/docs/2.3.0/ml-classification-regression.html

If you want to use XGBoost you should do it without pyspark (convert your spark dataframe to a pandas dataframe with .toPandas()) or use another algorithm (https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#module-pyspark.ml.classification).

But if you really want to use XGBoost with pyspark, you'll have to dive into pyspark to implement a distributed XGBoost yourself. Here is an article where they do so : http://dmlc.ml/2016/10/26/a-full-integration-of-xgboost-and-spark.html

Manille answered 11/6, 2018 at 8:32 Comment(0)

There is a maintained (used in production by several companies) distributed XGBoost library as mentioned above (https://github.com/dmlc/xgboost), however to use it from PySpark is a bit tricky, someone made a working pyspark wrapper for version 0.72 of the library, with 0.8 support in progress.

See here https://medium.com/@bogdan.cojocar/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb, and https://github.com/dmlc/xgboost/issues/1698 for the full discussion.

Make sure the xgboost jars are in your pyspark jar path.

Beehive answered 14/11, 2018 at 13:57 Comment(0)

There is an XBoost Implementation for Spark 2.4 and over here:

https://xgboost.readthedocs.io

Note that this is an external library but it should work easily with spark.

Hyder answered 17/9, 2019 at 20:7 Comment(0)

Recommended topics

Hot tags