apache-spark-mllib Questions

1

I'm wondering what the best way is to evaluate a fitted binary classification model using Apache Spark 2.4.5 and PySpark (Python). I want to consider different metrics such as accuracy, precision, ...
Insidious asked 20/3, 2020 at 10:23

1

I'm joining 2 datasets using Apache Spark ML LSH's approxSimilarityJoin method, but I'm seeings some strange behaviour. After the (inner) join the dataset is a bit skewed, however every time one o...
Silken asked 18/7, 2018 at 13:47

2

I have a DataFrame with the following columns. scala> show_times.printSchema root |-- account: string (nullable = true) |-- channel: string (nullable = true) |-- show_name: string (nullable ...
Biparty asked 2/5, 2017 at 16:27

4

Solved

I am curious if there is something similar to sklearn's http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html for apache-spark in the latest 2.0.1 rel...
Towhead asked 12/10, 2016 at 9:2

11

I want to find the parameters of ParamGridBuilder that make the best model in CrossValidator in Spark 1.4.x, In Pipeline Example in Spark documentation, they add different parameters (numFeatures,...

1

I am trying to use the MLLIB library (java) but one of my dependencies uses Jackson 2.9.9. I noticed that a pull request was made such that the master branch's dependency is upgraded to this partic...
Spermatozoon asked 16/8, 2019 at 6:26

4

Solved

I am using apache Spark ML lib to handle categorical features using one hot encoding. After writing the below code I am getting a vector c_idx_vec as output of one hot encoding. I do understand how...

2

Solved

I am trying to use the Spark implementation of the ALS algorithm for recommendation systems, so I built the DataFrame depicted below, as training data: |--------------|--------------|-------------...

1

I'm building a Random Forest model using Spark and I want to save it to use again later. I'm running this on pyspark (Spark 2.0.1) without HDFS, so the files are saved to the local file system. I'...
Criss asked 26/1, 2017 at 19:15

3

When train a model, say linear regression, we may make a normalization, like MinMaxScaler, on the train an test dataset. After we got a trained model and use it to make predictions, and scale back...

3

Let's say I have a DataFrame (that I read in from a csv on HDFS) and I want to train some algorithms on it via MLlib. How do I convert the rows into LabeledPoints or otherwise utilize MLlib on this...
Sidero asked 31/3, 2015 at 20:17

4

In Java, I use RowFactory.create() to create a Row: Row row = RowFactory.create(record.getLong(1), record.getInt(2), record.getString(3)); where "record" is a record from a database, but I canno...
Crosley asked 26/9, 2016 at 6:52

3

I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be abl...

3

Solved

I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type...

2

Solved

I'm trying to use Gaussian Mixture models on a sample of a dataset. I used bothMLlib (with pyspark) and scikit-learn and get very different results, the scikit-learn one looking more realistic. f...

1

I am trying to apply string indexer on multiple columns. Here is my code val stringIndexers = Categorical_Model.map { colName =>new StringIndexer().setInputCol(colName).setOutputCol(colName + "...
Newspaperwoman asked 22/7, 2019 at 9:40

4

Solved

Trying to do doc classification in Spark. I am not sure what the hashing does in HashingTF; does it sacrifice any accuracy? I doubt it, but I don't know. The spark doc says it uses the "hashing tri...
Piccalilli asked 4/2, 2016 at 16:6

2

I am new to both Spark and PySpark Data Frames and ML. How can I create a custom cross validation for the ML library. I want for example change the way the training folds are formed, e.g. stratifie...
Quenelle asked 4/11, 2015 at 0:12

3

I have a PMML file which encodes a logistic regression model that was NOT exported from MLlib. How can I import the model from PMML using MLlib in Java for evaluation/prediction? (I know that MLl...
Matteson asked 29/1, 2017 at 11:58

5

How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a ...

1

I would like to use my own loss function instead of the squared loss for the linear regression model in spark MLlib. So far can't find any part in the documentation that mentions if it is even poss...

2

Solved

I have two array fields in a data frame. I have a requirement to compare these two arrays and get the difference as an array(new column) in the same data frame. Expected output is: Column B ...
Enrika asked 27/10, 2017 at 11:15

2

After investing good amount of searching on net for this topic, I am ending up here if I can get some pointer . please read further After analyzing Spark 2.0 I concluded polynomial regression is n...
Renown asked 10/8, 2016 at 13:58

2

Solved

I'm using Apache Spark 2.3.0. When I upload a csv file and then I put df.show it shows me the table with all null values and I would like to know why because everything looks fine in the csv val d...
Conformance asked 11/10, 2018 at 16:36

4

Solved

I'm evaluating tools for production ML based applications and one of our options is Spark MLlib , but I have some questions about how to serve a model once its trained? For example in Azure ML, o...
Joliejoliet asked 10/11, 2016 at 17:24

© 2022 - 2024 — McMap. All rights reserved.