How to train a ML model in sparklyr and predict new values on another dataframe?
Asked Answered
A

1

7

Consider the following example

dtrain <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "Chinese Macao",
                              "Tokyo Japan Chinese"),
                     doc_id = 1:4,
                     class = c(1, 1, 1, 0))

dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE)

> dtrain_spark
# Source:   table<dtrain> [?? x 3]
# Database: spark_connection
  text                     doc_id class
  <chr>                     <int> <dbl>
1 Chinese Beijing Chinese       1     1
2 Chinese Chinese Shanghai      2     1
3 Chinese Macao                 3     1
4 Tokyo Japan Chinese           4     0

Here I have the classic Naive Bayes example where class identifies documents falling into the China category.

I am able to run a Naives Bayes classifier in sparklyr by doing the following:

dtrain_spark %>% 
ft_tokenizer(input.col = "text", output.col = "tokens") %>% 
ft_count_vectorizer(input_col = 'tokens', output_col = 'myvocab') %>% 
  select(myvocab, class) %>%  
  ml_naive_bayes( label_col = "class", 
                  features_col = "myvocab", 
                  prediction_col = "pcol",
                  probability_col = "prcol", 
                  raw_prediction_col = "rpcol",
                  model_type = "multinomial", 
                  smoothing = 0.6, 
                  thresholds = c(0.2, 0.4))

which outputs:

NaiveBayesModel (Transformer)
<naive_bayes_5e946aec597e> 
 (Parameters -- Column Names)
  features_col: myvocab
  label_col: class
  prediction_col: pcol
  probability_col: prcol
  raw_prediction_col: rpcol
 (Transformer Info)
  num_classes:  int 2 
  num_features:  int 6 
  pi:  num [1:2] -1.179 -0.368 
  theta:  num [1:2, 1:6] -1.417 -0.728 -2.398 -1.981 -2.398 ... 
  thresholds:  num [1:2] 0.2 0.4 

However, I have two major questions:

  1. How can I assess the performance of this classifier in-sample? Where are the accuracy metrics?

  2. Even more importantly, how can I use this trained model to predict new values, say, in the following spark test dataframe?

Test data:

dtest <- data_frame(text = c("Chinese Chinese Chinese Tokyo Japan",
                             "random stuff"))

dtest_spark <- copy_to(sc, dtest, overwrite = TRUE)

> dtest_spark
# Source:   table<dtest> [?? x 1]
# Database: spark_connection
  text                               
  <chr>                              
1 Chinese Chinese Chinese Tokyo Japan
2 random stuff 

Thanks!

Avionics answered 25/5, 2018 at 17:26 Comment(0)
T
10

How can I assess the performance of this classifier in-sample? Where are the accuracy metrics?

In general (there are some models which provide some form of summary), evaluation on training dataset is a separate step in Apache Spark. This fits nicely in the native Pipeline API.

Background:

Spark ML Pipelines are primarily build from two types of objects:

  • Transformers - objects which provide transform method, which map DataFrame to updated DataFrame.

    You can transform using Transformer with ml_transform method.

  • Estimators - objects which provide fit method, which map DataFrame to Transfomer. By convention corresponding Estimator / Transformer pairs are called Foo / FooModel.

    You can fit Estimator in sparklyr using ml_fit model.

Additionally ML Pipelines can be combined with Evaluators (see ml_*_evaluator and ml_*_eval methods) which can be used to compute different metrics on the transformed data, based on columns generated by a model (usually probability column or raw prediction).

You can apply Evaluator using ml_evaluate method.

Are related components include cross validator and train validation splits, which can be used for parameter tuning.

Examples:

sparklyr PipelineStages can be evaluated eagerly (as in your own code), by passing data directly, or lazily by passing a spark_connection instance and calling aforementioned methods (ml_fit, ml_transform, etc.).

It means you can define a Pipeline as follows:

pipeline <- ml_pipeline(
  ft_tokenizer(sc, input.col = "text", output.col = "tokens"),
  ft_count_vectorizer(sc, input_col = 'tokens', output_col = 'myvocab'),
  ml_naive_bayes(sc, label_col = "class", 
              features_col = "myvocab", 
              prediction_col = "pcol",
              probability_col = "prcol", 
              raw_prediction_col = "rpcol",
              model_type = "multinomial", 
              smoothing = 0.6, 
              thresholds = c(0.2, 0.4),
              uid = "nb")
)

Fit the PipelineModel:

model <- ml_fit(pipeline, dtrain_spark)

Transform, and apply one of available Evaluators:

ml_transform(model, dtrain_spark) %>% 
  ml_binary_classification_evaluator(
    label_col="class", raw_prediction_col= "rpcol", 
    metric_name = "areaUnderROC")
[1] 1

or

evaluator <- ml_multiclass_classification_evaluator(
    sc,
    label_col="class", prediction_col= "pcol", 
    metric_name = "f1")

ml_evaluate(evaluator, ml_transform(model, dtrain_spark))
[1] 1

Even more importantly, how can I use this trained model to predict new values, say, in the following spark test dataframe?

Use either ml_transform or ml_predict (the latter one is a convince wrapper, which applies further transformations on the output):

ml_transform(model, dtest_spark)
# Source:   table<sparklyr_tmp_cc651477ec7> [?? x 6]
# Database: spark_connection
  text                                tokens     myvocab   rpcol   prcol   pcol
  <chr>                               <list>     <list>    <list>  <list> <dbl>
1 Chinese Chinese Chinese Tokyo Japan <list [5]> <dbl [6]> <dbl [… <dbl …     0
2 random stuff                        <list [2]> <dbl [6]> <dbl [… <dbl …     1

Cross validation:

There is not enough data in the example, but you cross validate and fit hyperparameters as shown below:

# dontrun
ml_cross_validator(
  dtrain_spark,
  pipeline, 
  list(nb=list(smoothing=list(0.8, 1.0))),  # Note that name matches UID
  evaluator=evaluator)

Notes:

  • Please keep in mind that, Spark's multinomial Naive Bayes implementation considers only binary feature (0 or not 0).
  • If you use Pipelines with Vector columns (not formula-based calls), I strongly recommend using standardized (default) column names:

    • label for dependent variable.
    • features for assembled independent variables.
    • rawPrediction, prediction, probability for raw prediction, prediction and probability columns respectively.
Teide answered 28/5, 2018 at 17:32 Comment(9)
no worries. thanks for the precision! do you have some ideas about the parameter tuning thing that you were mentioning? thanks!Endurable
really amazing answer! well deserved bounty my friend :)Endurable
Thanks @kevinykuo ℕʘʘḆḽḘTeide
@user6910411 can I ask just a follow up question? something that does not work very well in the current code is when I need to use some mutate() clause between ft_tokenizer() and ft_count_vectorizer() in the pipeline above. What is the right syntax then? I tried to use the ft_dplyr_transformer() but somehow I must be missing something here. Thanks!Endurable
@Avionics Do you have anything specific in mind? Does it work on the output of ft_tokenizer? And what specifically does not work well? Spark complex types are not exactly R friendly...Teide
@user6910411 I simply have a mutate(myflag = text %rlike% 'user69) between the two. Simple dplyr verbs essentially. Thanks again!Endurable
I mean here in the example that could be just something like mutate(myflag = text %rlike% 'China)Endurable
Huh, I actually don't see how you can integrate these two. Maybe @kevinykuo does? But SQL transformer will work just fine ft_sql_transformer(sc, "SELECT *, text RLIKE 'China' AS myflag FROM __THIS__")Teide
You can do something like my_sql_transformer <- ft_dplyr_transformer(sc, dtrain_spark %>% mutate(myflag = text %rlike% 'China')) which generates a SQLTransformer for you, and you can put my_sql_transformer as a stage in your pipeline. You'd have to put it as the first stage though, since dplyr is explicit about column selection (i.e. it doesn't generate SELECT *). Feel free to open an issue on the repo if you feel like this behavior blocks a lot of use cases. You can also explicitly define the SQL transformer as @user6910411 suggests.Castra

© 2022 - 2024 — McMap. All rights reserved.