apache-spark-mllib Questions

1

Is there any pre-built Outlier Detection Algorithm/Interquartile Range identification methods available in Spark 2.0.0 ? I found some code here but i dont think this is available yet in spark2.0.0...

1

Solved

According to Combining Spark Streaming + MLlib it is possible to make a prediction over a stream of input in spark. The issue with the given example (which works on my cluster) is that the testDat...
Petras asked 17/2, 2018 at 23:18

1

Solved

I want to pretty print the result of a correlation in a zeppelin notebook: val Row(coeff: Matrix) = Correlation.corr(data, "features").head One of the ways to achieve this is to convert the resu...
Subjectify asked 25/2, 2018 at 18:50

3

Solved

I am trying to create a LDA model on a JSON file. Creating a spark context with the JSON file : import org.apache.spark.sql.SparkSession val sparkSession = SparkSession.builder .master("loc...

1

Solved

My python version is 3.6.3 and spark version is 2.2.1. Here is my code: from pyspark.ml.linalg import Vectors from pyspark.ml.feature import VectorAssembler from pyspark import SparkContext,...
Heartstrings asked 6/2, 2018 at 9:1

3

I have a set of data based on which I want to create a classification model. Each row has the following form: user1,class1,product1 user1,class1,product2 user1,class1,product5 user2,class1,product...
Cinerarium asked 7/8, 2015 at 7:53

2

Solved

Given a MatrixFactorizationModel what would be the most efficient way to return the full matrix of user-product predictions (in practice, filtered by some threshold to maintain sparsity)? Via the ...
Sharyl asked 12/10, 2014 at 15:21

4

I'm trying to perform a logistic regression (LogisticRegressionWithLBFGS) with Spark MLlib (with Scala) on a dataset which contains categorical variables. I discover Spark was not able to work with...

2

Solved

I have a spark dataframe 'mydataframe' with many columns. I am trying to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values). I want to extract 7 cl...

1

Solved

it's my very first time trying to run KMeans cluster analysis in Spark, so, I am sorry for a stupid question. I have a spark dataframe mydataframe with many columns. I want to run kmeans on only t...
Makalu asked 1/12, 2017 at 1:14

2

Solved

I'm trying to run self-contained application using scala on apache spark based on example here: http://spark.apache.org/docs/latest/ml-pipeline.html Here's my complete code: import org.apache.spa...
Fetiparous asked 27/10, 2016 at 10:7

2

Solved

In order to build a NaiveBayes multiclass classifier, I am using a CrossValidator to select the best parameters in my pipeline: val cv = new CrossValidator() .setEstimator(pipeline) .setEstimato...
Underproof asked 8/1, 2016 at 13:59

3

Solved

I have an RDD with a tuple of values (String, SparseVector) and I want to create a DataFrame using the RDD. To get a (label:string, features:vector) DataFrame which is the Schema required by most o...

1

Solved

I am using Spark Scala to calculate cosine similarity between the Dataframe rows. Dataframe format is below root |-- SKU: double (nullable = true) |-- Features: vector (nullable = true) Samp...
Illfated asked 30/10, 2017 at 7:38

2

Solved

I am working with Spark 2.1.1 on a dataset with ~2000 features and trying to create a basic ML Pipeline, consisting of some Transformers and a Classifier. Let's assume for the sake of simplicity th...
Lilithe asked 11/5, 2017 at 9:35

3

Solved

it is my first time with PySpark, (Spark 2), and I'm trying to create a toy dataframe for a Logit model. I ran successfully the tutorial and would like to pass my own data into it. I've tried thi...

1

Solved

In the Mllib version of Random Forest there was a possibility to specify the columns with nominal features (numerical but still categorical variables) with parameter categoricalFeaturesInfo What's...

4

I'm working with Spark 1.3.0 using PySpark and MLlib and I need to save and load my models. I use code like this (taken from the official documentation ) from pyspark.mllib.recommendation import A...
Finegan asked 25/3, 2015 at 12:3

1

Solved

I can extract vocabulary from CountVecotizerModel by the following way fl = StopWordsRemover(inputCol="words", outputCol="filtered") df = fl.transform(df) cv = CountVectorizer(inputCol="filtered",...
Pug asked 12/10, 2017 at 17:27

1

Solved

According to LinearRegressionSummary (Spark 2.1.0 JavaDoc), p-values are only available for the "normal" solver. This value is only available when using the "normal" solver. What the hell is t...
Ambrosio asked 11/10, 2017 at 19:49

1

Solved

I am trying to build a demo project in java 9 with maven that uses the dependency: <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.10<...
Telescopic asked 11/10, 2017 at 8:36

2

What is the optimum number of vector size to be set in word2vec algorithm if the total number of unique words is greater than 1 billion? I am using Apache Spark Mllib 1.6.0 for word2vec. Sample ...
Eyeopener asked 4/10, 2017 at 8:58

4

when I am trying to run it on this folder it is throwing me ExecutorLostFailure everytime Hi I am a beginner in Spark. I am trying to run a job on Spark 1.4.1 with 8 slave nodes with 11.7 GB memory...
Godred asked 21/7, 2015 at 2:51

2

Solved

How can you evaluate the implicit feedback collaborative filtering algorithm of Apache Spark, given that the implicit "ratings" can vary from zero to anything, so a simple MSE or RMSE does not have...
Clinometer asked 28/9, 2017 at 6:36

1

Error: ERROR TaskSetManager: Total size of serialized results of XXXX tasks (2.0 GB) is bigger than spark.driver.maxResultSize (2.0 GB) Goal: Obtain recommendation for all the users using the mo...
Snappish asked 2/12, 2015 at 5:25

© 2022 - 2024 — McMap. All rights reserved.