apache-spark-mllib Questions
1
Is there any pre-built Outlier Detection Algorithm/Interquartile Range identification methods available in Spark 2.0.0 ?
I found some code here but i dont think this is available yet in spark2.0.0...
Trader asked 8/10, 2016 at 7:13
1
Solved
According to Combining Spark Streaming + MLlib it is possible to make a prediction over a stream of input in spark.
The issue with the given example (which works on my cluster) is that the testDat...
Petras asked 17/2, 2018 at 23:18
1
Solved
I want to pretty print the result of a correlation in a zeppelin notebook:
val Row(coeff: Matrix) = Correlation.corr(data, "features").head
One of the ways to achieve this is to convert the resu...
Subjectify asked 25/2, 2018 at 18:50
3
Solved
I am trying to create a LDA model on a JSON file.
Creating a spark context with the JSON file :
import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession.builder
.master("loc...
Badtempered asked 7/8, 2016 at 21:48
1
Solved
My python version is 3.6.3 and spark version is 2.2.1. Here is my code:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark import SparkContext,...
Heartstrings asked 6/2, 2018 at 9:1
3
I have a set of data based on which I want to create a classification model. Each row has the following form:
user1,class1,product1
user1,class1,product2
user1,class1,product5
user2,class1,product...
Cinerarium asked 7/8, 2015 at 7:53
2
Solved
Given a MatrixFactorizationModel what would be the most efficient way to return the full matrix of user-product predictions (in practice, filtered by some threshold to maintain sparsity)?
Via the ...
Sharyl asked 12/10, 2014 at 15:21
4
I'm trying to perform a logistic regression (LogisticRegressionWithLBFGS) with Spark MLlib (with Scala) on a dataset which contains categorical variables. I discover Spark was not able to work with...
Sicard asked 7/5, 2015 at 14:56
2
Solved
I have a spark dataframe 'mydataframe' with many columns. I am trying to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values). I want to extract 7 cl...
Bonaparte asked 1/12, 2017 at 2:22
1
Solved
it's my very first time trying to run KMeans cluster analysis in Spark, so, I am sorry for a stupid question.
I have a spark dataframe mydataframe with many columns. I want to run kmeans on only t...
Makalu asked 1/12, 2017 at 1:14
2
Solved
I'm trying to run self-contained application using scala on apache spark based on example here:
http://spark.apache.org/docs/latest/ml-pipeline.html
Here's my complete code:
import org.apache.spa...
Fetiparous asked 27/10, 2016 at 10:7
2
Solved
In order to build a NaiveBayes multiclass classifier, I am using a CrossValidator to select the best parameters in my pipeline:
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEstimato...
Underproof asked 8/1, 2016 at 13:59
3
Solved
I have an RDD with a tuple of values (String, SparseVector) and I want to create a DataFrame using the RDD. To get a (label:string, features:vector) DataFrame which is the Schema required by most o...
Pagas asked 23/9, 2015 at 16:47
1
Solved
I am using Spark Scala to calculate cosine similarity between the Dataframe rows.
Dataframe format is below
root
|-- SKU: double (nullable = true)
|-- Features: vector (nullable = true)
Samp...
Illfated asked 30/10, 2017 at 7:38
2
Solved
I am working with Spark 2.1.1 on a dataset with ~2000 features and trying to create a basic ML Pipeline, consisting of some Transformers and a Classifier.
Let's assume for the sake of simplicity th...
Lilithe asked 11/5, 2017 at 9:35
3
Solved
it is my first time with PySpark, (Spark 2), and I'm trying to create a toy dataframe for a Logit model. I ran successfully the tutorial and would like to pass my own data into it.
I've tried thi...
Toothache asked 12/7, 2017 at 16:55
1
Solved
In the Mllib version of Random Forest there was a possibility to specify the columns with nominal features (numerical but still categorical variables) with parameter categoricalFeaturesInfo
What's...
Tm asked 15/10, 2017 at 20:42
4
I'm working with Spark 1.3.0 using PySpark and MLlib and I need to save and load my models. I use code like this (taken from the official documentation )
from pyspark.mllib.recommendation import A...
Finegan asked 25/3, 2015 at 12:3
1
Solved
I can extract vocabulary from CountVecotizerModel by the following way
fl = StopWordsRemover(inputCol="words", outputCol="filtered")
df = fl.transform(df)
cv = CountVectorizer(inputCol="filtered",...
Pug asked 12/10, 2017 at 17:27
1
Solved
According to LinearRegressionSummary (Spark 2.1.0 JavaDoc), p-values are only available for the "normal" solver.
This value is only available when using the "normal" solver.
What the hell is t...
Ambrosio asked 11/10, 2017 at 19:49
1
Solved
I am trying to build a demo project in java 9 with maven that uses the dependency:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.10<...
Telescopic asked 11/10, 2017 at 8:36
2
What is the optimum number of vector size to be set in word2vec algorithm if the total number of unique words is greater than 1 billion?
I am using Apache Spark Mllib 1.6.0 for word2vec.
Sample ...
Eyeopener asked 4/10, 2017 at 8:58
4
when I am trying to run it on this folder it is throwing me ExecutorLostFailure everytime
Hi I am a beginner in Spark. I am trying to run a job on Spark 1.4.1 with 8 slave nodes with 11.7 GB memory...
Godred asked 21/7, 2015 at 2:51
2
Solved
How can you evaluate the implicit feedback collaborative filtering algorithm of Apache Spark, given that the implicit "ratings" can vary from zero to anything, so a simple MSE or RMSE does not have...
Clinometer asked 28/9, 2017 at 6:36
1
Error:
ERROR TaskSetManager: Total size of serialized results of XXXX tasks (2.0 GB) is bigger than spark.driver.maxResultSize (2.0 GB)
Goal: Obtain recommendation for all the users using the mo...
Snappish asked 2/12, 2015 at 5:25
© 2022 - 2024 — McMap. All rights reserved.