apache-spark-mllib Questions

1

Solved

How do I get the mapping out of a trained Spark MLlib StringIndexerModel? val stringIndexer = new StringIndexer() .setInputCol("myCol") .setOutputCol("myColIdx") val stringIndexerModel = stringI...
Infraction asked 23/4, 2017 at 19:9

3

I need some suggestions to build a good model to make recommendation by using Collaborative Filtering of spark. There is a sample code in the official website. I also past it following: from pyspar...

1

Solved

What is difference between pyspark mllib and pyspark ml packages ? : https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html https://spark.apache.org/docs/latest/api/python/pyspark.ml....

3

Solved

I have encountered the "all-pairs similarity" problem in my recommendation system. Thanks to this databricks blog, it seems RowMatrix may come to help. However, RowMatrix is a matrix type without ...
Tesstessa asked 25/4, 2015 at 2:55

2

Solved

Is there a way to train a LDA model in an online-learning fashion, ie. loading a previously train model, and update it with new documents ?

1

Solved

I have trained a model in python using sklearn. How we can use same model to load in Spark and generate predictions on a spark RDD ?
Siobhansion asked 19/3, 2017 at 14:15

1

I am setting up a very simple logistic regression problem in scikit-learn and in spark.ml, and the results diverge: the models they learn are different, but I can't figure out why (data is the same...
Largehearted asked 10/3, 2017 at 23:28

1

I have a DataFrame with a column named a.b. When I specify a.b as the input column name to a StringIndexer, AnalysisException with the message "cannot resolve 'a.b' given input columns a.b". I'm us...
Cotswolds asked 22/1, 2016 at 18:22

2

Solved

I'm trying to build a very simple scala standalone app using the Mllib, but I get the following error when trying to bulid the program: Object Mllib is not a member of package org.apache.spark T...
Laundry asked 12/12, 2014 at 6:50

2

Solved

I have several categorical features and would like to transform them all using OneHotEncoder. However, when I tried to apply the StringIndexer, there I get an error: stringIndexer = StringIndexer(...

2

Solved

I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually?

1

I am new to Spark 2. I tried Spark tfidf example sentenceData = spark.createDataFrame([ (0.0, "Hi I heard about Spark") ], ["label", "sentence"]) tokenizer = Tokenizer(inputCol="sentence", outp...

0

I'm trying to obtain ROC Curve for GBTClassifier. One way is to reuse BinaryClassificationMetrics, however the path given in the documentation (https://spark.apache.org/docs/latest/mllib-evaluati...
Maxinemaxiskirt asked 16/2, 2017 at 15:7

2

Solved

I'm using RandomForest.featureImportances but I don't understand the output result. I have 12 features, and this is the output I get. I get that this might not be an apache-spark specific que...

1

I was trying to build Logistic regression model on a sample data. The output from the model we can get are the weights of features used to build the model. I could not find Spark API for standard...

2

Solved

Given my pyspark Row object: >>> row Row(clicked=0, features=SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752})) >>> row.clicked 0 >>> row.features SparseVector(7, {0: 1.0, 3: ...

1

Solved

I'm predicting ratings in between processes that batch train the model. I'm using the approach outlined here: ALS model - how to generate full_u * v^t * v? ! rm -rf ml-1m.zip ml-1m ! wget --quiet ...
Endometrium asked 10/1, 2017 at 12:32

1

Solved

I'd like to make sure I'm training on a stratified sample of my data. It seems this is supported by Spark 2.1 and earlier versions via JavaPairRDD.sampleByKey(...) and JavaPairRDD.sampleByKeyExact...
Frightfully asked 16/1, 2017 at 9:10

2

Solved

I am preparing a DataFrame with an id and a vector of my features to be used later for doing predictions. I do a groupBy on my dataframe, and in my groupBy I am merging couple of columns as lists i...

1

I'm using MLlib's matrix factorization to recommend items to users. I have about a big implicit interaction matrix of M=20 million users and N=50k items. After training the model I want to get a sh...
Horacehoracio asked 23/8, 2016 at 15:5

2

Solved

KMeans has several parameters for its training, with initialization mode defaulted to kmeans||. The problem is that it marches quickly (less than 10min) to the first 13 stages, but then hangs compl...

2

Solved

I have a DenseVector RDD like this >>> frequencyDenseVectors.collect() [DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1....

1

Solved

I have a pyspark data frame whih has a column containing strings. I want to split this column into words Code: >>> sentenceData = sqlContext.read.load('file://sample1.csv', format='com.d...

1

I am using Spark cluster 2.0 and I would like to convert a vector from org.apache.spark.mllib.linalg.VectorUDT to org.apache.spark.ml.linalg.VectorUDT. # Import LinearRegression class from pyspark...

2

Solved

I need addition of two matrices that are stored in two files. The content of latest1.txt and latest2.txt has the next str: 1 2 3 4 5 6 7 8 9 I am reading those files as follows: scala> val...
Anton asked 30/1, 2015 at 9:29

© 2022 - 2024 — McMap. All rights reserved.