it's my very first time trying to run KMeans cluster analysis in Spark, so, I am sorry for a stupid question.
I have a spark dataframe mydataframe
with many columns. I want to run kmeans on only two columns: lat
and long
(latitude & longitude) using them as simple values. I want to extract 7 clusters based on just those 2 columns. I've tried:
from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel
# Prepare a data frame with just 2 columns:
data = mydataframe.select('lat', 'long')
# Build the model (cluster the data)
clusters = KMeans.train(data, 7, maxIterations=15, initializationMode="random")
But I am getting an error:
'DataFrame' object has no attribute 'map'
What should be the object one feeds to KMeans.train
?
Clearly, it doesn't accept a DataFrame.
How should I prepare my data frame for the analysis?
Thank you very much!