How to train a ML model in sparklyr and predict new values on another dataframe?

dtrain <- data_frame(text = c("Chinese Beijing Chinese", "Chinese Chinese Shanghai", "Chinese Macao", "Tokyo Japan Chinese"), doc_id = 1:4, class = c(1, 1, 1, 0)) dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE) > dtrain_spark # Source: table<dtrain> [?? x 3] # Database: spark_connection text doc_id class <chr> <int> <dbl> 1 Chinese Beijing Chinese 1 1 2 Chinese Chinese Shanghai 2 1 3 Chinese Macao 3 1 4 Tokyo Japan Chinese 4 0

dtrain_spark %>% ft_tokenizer(input.col = "text", output.col = "tokens") %>% ft_count_vectorizer(input_col = 'tokens', output_col = 'myvocab') %>% select(myvocab, class) %>% ml_naive_bayes( label_col = "class", features_col = "myvocab", prediction_col = "pcol", probability_col = "prcol", raw_prediction_col = "rpcol", model_type = "multinomial", smoothing = 0.6, thresholds = c(0.2, 0.4))

NaiveBayesModel (Transformer) <naive_bayes_5e946aec597e> (Parameters -- Column Names) features_col: myvocab label_col: class prediction_col: pcol probability_col: prcol raw_prediction_col: rpcol (Transformer Info) num_classes: int 2 num_features: int 6 pi: num [1:2] -1.179 -0.368 theta: num [1:2, 1:6] -1.417 -0.728 -2.398 -1.981 -2.398 ... thresholds: num [1:2] 0.2 0.4

dtest <- data_frame(text = c("Chinese Chinese Chinese Tokyo Japan", "random stuff")) dtest_spark <- copy_to(sc, dtest, overwrite = TRUE) > dtest_spark # Source: table<dtest> [?? x 1] # Database: spark_connection text <chr> 1 Chinese Chinese Chinese Tokyo Japan 2 random stuff

How can I assess the performance of this classifier in-sample? Where are the accuracy metrics?

In general (there are some models which provide some form of summary), evaluation on training dataset is a separate step in Apache Spark. This fits nicely in the native Pipeline API.

Background:

Spark ML Pipelines are primarily build from two types of objects:

Transformers - objects which provide transform method, which map DataFrame to updated DataFrame.

You can transform using Transformer with ml_transform method.
Estimators - objects which provide fit method, which map DataFrame to Transfomer. By convention corresponding Estimator / Transformer pairs are called Foo / FooModel.

You can fit Estimator in sparklyr using ml_fit model.

Additionally ML Pipelines can be combined with Evaluators (see ml_*_evaluator and ml_*_eval methods) which can be used to compute different metrics on the transformed data, based on columns generated by a model (usually probability column or raw prediction).

You can apply Evaluator using ml_evaluate method.

Are related components include cross validator and train validation splits, which can be used for parameter tuning.

Examples:

sparklyr PipelineStages can be evaluated eagerly (as in your own code), by passing data directly, or lazily by passing a spark_connection instance and calling aforementioned methods (ml_fit, ml_transform, etc.).

It means you can define a Pipeline as follows:

pipeline <- ml_pipeline(
  ft_tokenizer(sc, input.col = "text", output.col = "tokens"),
  ft_count_vectorizer(sc, input_col = 'tokens', output_col = 'myvocab'),
  ml_naive_bayes(sc, label_col = "class", 
              features_col = "myvocab", 
              prediction_col = "pcol",
              probability_col = "prcol", 
              raw_prediction_col = "rpcol",
              model_type = "multinomial", 
              smoothing = 0.6, 
              thresholds = c(0.2, 0.4),
              uid = "nb")
)

Fit the PipelineModel:

model <- ml_fit(pipeline, dtrain_spark)

Transform, and apply one of available Evaluators:

ml_transform(model, dtrain_spark) %>% 
  ml_binary_classification_evaluator(
    label_col="class", raw_prediction_col= "rpcol", 
    metric_name = "areaUnderROC")

[1] 1

evaluator <- ml_multiclass_classification_evaluator(
    sc,
    label_col="class", prediction_col= "pcol", 
    metric_name = "f1")

ml_evaluate(evaluator, ml_transform(model, dtrain_spark))

[1] 1

Even more importantly, how can I use this trained model to predict new values, say, in the following spark test dataframe?

Use either ml_transform or ml_predict (the latter one is a convince wrapper, which applies further transformations on the output):

ml_transform(model, dtest_spark)

# Source:   table<sparklyr_tmp_cc651477ec7> [?? x 6]
# Database: spark_connection
  text                                tokens     myvocab   rpcol   prcol   pcol
  <chr>                               <list>     <list>    <list>  <list> <dbl>
1 Chinese Chinese Chinese Tokyo Japan <list [5]> <dbl [6]> <dbl [… <dbl …     0
2 random stuff                        <list [2]> <dbl [6]> <dbl [… <dbl …     1

Cross validation:

There is not enough data in the example, but you cross validate and fit hyperparameters as shown below:

# dontrun
ml_cross_validator(
  dtrain_spark,
  pipeline, 
  list(nb=list(smoothing=list(0.8, 1.0))),  # Note that name matches UID
  evaluator=evaluator)

Notes:

Please keep in mind that, Spark's multinomial Naive Bayes implementation considers only binary feature (0 or not 0).
If you use Pipelines with Vector columns (not formula-based calls), I strongly recommend using standardized (default) column names:
- label for dependent variable.
- features for assembled independent variables.
- rawPrediction, prediction, probability for raw prediction, prediction and probability columns respectively.

Recommended topics

Hot tags