I have built a KMeansModel. My results are stored in a PySpark DataFrame called
transformed
.
(a) How do I interpret the contents of transformed
?
(b) How do I create one or more Pandas DataFrame from transformed
that would show summary statistics for each of the 13 features for each of the 14 clusters?
from pyspark.ml.clustering import KMeans
# Trains a k-means model.
kmeans = KMeans().setK(14).setSeed(1)
model = kmeans.fit(X_spark_scaled) # Fits a model to the input dataset with optional parameters.
transformed = model.transform(X_spark_scaled).select("features", "prediction") # X_spark_scaled is my PySpark DataFrame consisting of 13 features
transformed.show(5, truncate = False)
+------------------------------------------------------------------------------------------------------------------------------------+----------+
|features |prediction|
+------------------------------------------------------------------------------------------------------------------------------------+----------+
|(14,[4,5,7,8,9,13],[1.0,1.0,485014.0,0.25,2.0,1.0]) |12 |
|(14,[2,7,8,9,12,13],[1.0,2401233.0,1.0,1.0,1.0,1.0]) |2 |
|(14,[2,4,5,7,8,9,13],[0.3333333333333333,0.6666666666666666,0.6666666666666666,2429111.0,0.9166666666666666,1.3333333333333333,3.0])|2 |
|(14,[4,5,7,8,9,12,13],[1.0,1.0,2054748.0,0.15384615384615385,11.0,1.0,1.0]) |11 |
|(14,[2,7,8,9,13],[1.0,43921.0,1.0,1.0,1.0]) |1 |
+------------------------------------------------------------------------------------------------------------------------------------+----------+
only showing top 5 rows
As an aside, I found from another SO post that I can map the features to their names like below. It would be nice to have summary statistics (mean, median, std, min, max) for each feature of each cluster in one or more Pandas dataframes.
attr_list = [attr for attr in chain(*transformed.schema['features'].metadata['ml_attr']['attrs'].values())]
attr_list
Per request in the comments, here is a snapshot consisting of 2 records of the data (don't want to provide too many records -- proprietary information here)
+---------------------+------------------------+-----------------------+----------------------+----------------------+------------------------------+---------------------------------+------------+-------------------+--------------------+------------------------------------+--------------------------+-------------------------------+-----------------+--------------------+--------------------+
|device_type_robot_pct|device_type_smart_tv_pct|device_type_desktop_pct|device_type_tablet_pct|device_type_mobile_pct|device_type_mobile_persist_pct|visitors_seen_with_anonymiser_pct|ip_time_span| ip_weight|mean_ips_per_visitor|visitors_seen_with_multi_country_pct|international_visitors_pct|visitors_seen_with_multi_ua_pct|count_tuids_on_ip| features| scaledFeatures|
+---------------------+------------------------+-----------------------+----------------------+----------------------+------------------------------+---------------------------------+------------+-------------------+--------------------+------------------------------------+--------------------------+-------------------------------+-----------------+--------------------+--------------------+
| 0.0| 0.0| 0.0| 0.0| 1.0| 1.0| 0.0| 485014.0| 0.25| 2.0| 0.0| 0.0| 0.0| 1.0|(14,[4,5,7,8,9,13...|(14,[4,5,7,8,9,13...|
| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0| 0.0| 2401233.0| 1.0| 1.0| 0.0| 0.0| 1.0| 1.0|(14,[2,7,8,9,12,1...|(14,[2,7,8,9,12,1...|
X_spark_scaled
, please? – Scansion