apache-spark-mllib Questions
3
I have a dataframe gi_man_df where group can be n:
+------------------+-----------------+--------+--------------+
| group | number|rand_int| rand_double|
+------------------+-----------------+----...
Krems asked 8/2, 2017 at 14:42
2
Looking over the source code for Bisecting K-means it seems that it builds an internal tree representation of the cluster assignments at each level it progresses. Is it possible to get access to th...
Patron asked 20/1, 2017 at 21:2
6
I found the same discussion in comments section of Create a custom Transformer in PySpark ML, but there is no clear answer. There is also an unresolved JIRA corresponding to that: https://issues.ap...
Downdraft asked 30/12, 2016 at 16:25
8
Solved
I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected:
from pyspark.ml.classification import LogisticRegression...
Limey asked 18/4, 2016 at 14:46
3
Solved
How to create SparseVector and dense Vector representations
if the DenseVector is:
denseV = np.array([0., 3., 0., 4.])
What will be the Sparse Vector representation ?
Suffix asked 20/7, 2015 at 17:37
3
I'm new to spark. I'm coding a machine learning algorithm in Spark standalone (v3.0.0) with this configurations set:
SparkConf conf = new SparkConf();
conf.setMaster("local[*]");
conf.set...
False asked 2/9, 2020 at 10:52
2
Solved
I used the Java's API, i.e. Apache-Spark 1.2.0, and created two parse vectors as follows.
Vector v1 = Vectors.sparse(3, new int[]{0, 2}, new double[]{1.0, 3.0});
Vector v2 = Vectors.sparse(2, new ...
Cobelligerent asked 7/4, 2015 at 7:2
2
Solved
If I increase the model size of my word2vec model I start to get this kind of exception in my log:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 6
...
Barron asked 23/4, 2016 at 19:38
2
I am trying to build a simple custom Estimator in PySpark MLlib. I have here that it is possible to write a custom Transformer but I am not sure how to do it on an Estimator. I also don't understan...
Gloucester asked 17/5, 2016 at 8:4
4
I'm following the instructions of PMML model export - spark.mllib to create a K-means model.
val numClusters = 10
val numIterations = 10
val clusters = KMeans.train(data, numClusters, numIteration...
Lodgings asked 15/6, 2016 at 14:35
4
Solved
I want to use pyspark.mllib.stat.Statistics.corr function to compute correlation between two columns of pyspark.sql.dataframe.DataFrame object. corr function expects to take an rdd of Vectors objec...
Nicholasnichole asked 3/6, 2016 at 16:6
3
Solved
from pyspark.ml.regression import RandomForestRegressionModel
rf = RandomForestRegressor(labelCol="label",featuresCol="features", numTrees=5, maxDepth=10, seed=42)
rf_model = rf.fit(train_df)
rf_m...
Curtal asked 17/2, 2017 at 17:12
3
Solved
I have a dataframe resulting from a sql query
df1 = sqlContext.sql("select * from table_test")
I need to convert this dataframe to libsvm format so that it can be provided as an input for
pysp...
Furry asked 11/5, 2017 at 15:44
1
I am looking for a way to run the spark.ml.feature.PCA function over grouped data returned from a groupBy() call on a dataframe. But I'm not sure if this is possible, or how to achieve it. This is ...
Radiotelegraph asked 21/7, 2017 at 14:44
3
Solved
I have a sparse vector like this
>>> countVectors.rdd.map(lambda vector: vector[1]).collect()
[SparseVector(13, {0: 1.0, 2: 1.0, 3: 1.0, 6: 1.0, 8: 1.0, 9: 1.0, 10: 1.0, 12: 1.0}), Sparse...
Set asked 26/12, 2016 at 8:39
3
Solved
The following ran successfully on a Cloudera CDSW cluster gateway.
import pyspark
from pyspark.sql import SparkSession
spark = (SparkSession
.builder
.config("spark.jars.packages","JohnSnowLabs:...
Cadenza asked 7/12, 2017 at 22:52
2
I have a Spark Dataframe as below:
predictions.show(5)
+------+----+------+-----------+
| user|item|rating| prediction|
+------+----+------+-----------+
|379433| 31| 1| 0.08203495|
| 1834| 31| 1| 0...
Regelate asked 1/11, 2016 at 18:45
3
Solved
I'm trying to use Spark 2.3.1 with Java.
I followed examples in the documentation but keep getting poorly described exception when calling .fit(trainingData).
Exception in thread "main" java.lang...
Asphyxiant asked 15/7, 2018 at 22:11
5
I'm trying to extract the feature importances of a random forest object I have trained using PySpark. However, I do not see an example of doing this anywhere in the documentation, nor is it a metho...
Choreographer asked 10/3, 2015 at 19:1
3
Solved
I have an MLLib model saved in a folder on S3, say bucket-name/test-model. Now, I have a spark cluster (let's say on a single machine for now). I am running the following commands to load the model...
Denman asked 28/9, 2019 at 6:34
2
Solved
I wanted to convert the spark data frame to add using the code below:
from pyspark.mllib.clustering import KMeans
spark_df = sqlContext.createDataFrame(pandas_df)
rdd = spark_df.map(lambda data: V...
Tuberous asked 16/9, 2016 at 15:44
3
Solved
This might be a very simple question. But is there any simple way to measure the execution time of a spark job (submitted using spark-submit)?
It would help us in profiling the spark jobs based on...
Fluorene asked 30/4, 2016 at 0:28
3
I am trying to take columns from a DataFrame and convert it to an RDD[Vector].
The problem is that I have columns with a "dot" in their name as the following dataset :
"col0.1","col1.2","col2.3"...
Unpractical asked 5/6, 2017 at 10:33
3
I am using Spark MLlib 1.4.1 to create decisionTree model. Now I want to extract rules from decision tree.
How can I extract rules ?
Longdrawnout asked 3/8, 2015 at 8:4
3
I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark.
Since I had textual categorical variables and numeric ones too, I had to use a pipe...
Instable asked 19/6, 2018 at 22:8
1 Next >
© 2022 - 2024 — McMap. All rights reserved.