apache-spark-mllib Questions

3

I have a dataframe gi_man_df where group can be n: +------------------+-----------------+--------+--------------+ | group | number|rand_int| rand_double| +------------------+-----------------+----...

2

Looking over the source code for Bisecting K-means it seems that it builds an internal tree representation of the cluster assignments at each level it progresses. Is it possible to get access to th...
Patron asked 20/1, 2017 at 21:2

6

I found the same discussion in comments section of Create a custom Transformer in PySpark ML, but there is no clear answer. There is also an unresolved JIRA corresponding to that: https://issues.ap...
Downdraft asked 30/12, 2016 at 16:25

8

Solved

I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected: from pyspark.ml.classification import LogisticRegression...

3

Solved

How to create SparseVector and dense Vector representations if the DenseVector is: denseV = np.array([0., 3., 0., 4.]) What will be the Sparse Vector representation ?
Suffix asked 20/7, 2015 at 17:37

3

I'm new to spark. I'm coding a machine learning algorithm in Spark standalone (v3.0.0) with this configurations set: SparkConf conf = new SparkConf(); conf.setMaster("local[*]"); conf.set...
False asked 2/9, 2020 at 10:52

2

Solved

I used the Java's API, i.e. Apache-Spark 1.2.0, and created two parse vectors as follows. Vector v1 = Vectors.sparse(3, new int[]{0, 2}, new double[]{1.0, 3.0}); Vector v2 = Vectors.sparse(2, new ...
Cobelligerent asked 7/4, 2015 at 7:2

2

Solved

If I increase the model size of my word2vec model I start to get this kind of exception in my log: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 6 ...
Barron asked 23/4, 2016 at 19:38

2

I am trying to build a simple custom Estimator in PySpark MLlib. I have here that it is possible to write a custom Transformer but I am not sure how to do it on an Estimator. I also don't understan...

4

I'm following the instructions of PMML model export - spark.mllib to create a K-means model. val numClusters = 10 val numIterations = 10 val clusters = KMeans.train(data, numClusters, numIteration...
Lodgings asked 15/6, 2016 at 14:35

4

Solved

I want to use pyspark.mllib.stat.Statistics.corr function to compute correlation between two columns of pyspark.sql.dataframe.DataFrame object. corr function expects to take an rdd of Vectors objec...
Nicholasnichole asked 3/6, 2016 at 16:6

3

Solved

from pyspark.ml.regression import RandomForestRegressionModel rf = RandomForestRegressor(labelCol="label",featuresCol="features", numTrees=5, maxDepth=10, seed=42) rf_model = rf.fit(train_df) rf_m...

3

Solved

I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pysp...

1

I am looking for a way to run the spark.ml.feature.PCA function over grouped data returned from a groupBy() call on a dataframe. But I'm not sure if this is possible, or how to achieve it. This is ...
Radiotelegraph asked 21/7, 2017 at 14:44

3

Solved

I have a sparse vector like this >>> countVectors.rdd.map(lambda vector: vector[1]).collect() [SparseVector(13, {0: 1.0, 2: 1.0, 3: 1.0, 6: 1.0, 8: 1.0, 9: 1.0, 10: 1.0, 12: 1.0}), Sparse...

3

Solved

The following ran successfully on a Cloudera CDSW cluster gateway. import pyspark from pyspark.sql import SparkSession spark = (SparkSession .builder .config("spark.jars.packages","JohnSnowLabs:...

2

I have a Spark Dataframe as below: predictions.show(5) +------+----+------+-----------+ | user|item|rating| prediction| +------+----+------+-----------+ |379433| 31| 1| 0.08203495| | 1834| 31| 1| 0...

3

Solved

I'm trying to use Spark 2.3.1 with Java. I followed examples in the documentation but keep getting poorly described exception when calling .fit(trainingData). Exception in thread "main" java.lang...
Asphyxiant asked 15/7, 2018 at 22:11

5

I'm trying to extract the feature importances of a random forest object I have trained using PySpark. However, I do not see an example of doing this anywhere in the documentation, nor is it a metho...
Choreographer asked 10/3, 2015 at 19:1

3

Solved

I have an MLLib model saved in a folder on S3, say bucket-name/test-model. Now, I have a spark cluster (let's say on a single machine for now). I am running the following commands to load the model...

2

Solved

I wanted to convert the spark data frame to add using the code below: from pyspark.mllib.clustering import KMeans spark_df = sqlContext.createDataFrame(pandas_df) rdd = spark_df.map(lambda data: V...

3

Solved

This might be a very simple question. But is there any simple way to measure the execution time of a spark job (submitted using spark-submit)? It would help us in profiling the spark jobs based on...
Fluorene asked 30/4, 2016 at 0:28

3

I am trying to take columns from a DataFrame and convert it to an RDD[Vector]. The problem is that I have columns with a "dot" in their name as the following dataset : "col0.1","col1.2","col2.3"...

3

I am using Spark MLlib 1.4.1 to create decisionTree model. Now I want to extract rules from decision tree. How can I extract rules ?
Longdrawnout asked 3/8, 2015 at 8:4

3

I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark. Since I had textual categorical variables and numeric ones too, I had to use a pipe...
Instable asked 19/6, 2018 at 22:8

© 2022 - 2024 — McMap. All rights reserved.