Python pyspark KMeans用法及代码示例

本文简要介绍 pyspark.ml.clustering.KMeans 的用法。

用法: class pyspark.ml.clustering.KMeans(*, featuresCol='features', predictionCol='prediction', k=2, initMode='k-means||', initSteps=2, tol=0.0001, maxIter=20, seed=None, distanceMeasure='euclidean', weightCol=None)

K-means 聚类与 k-means++ 类似的初始化模式(Bahmani 等人的 k-means|| 算法)。

1.5.0 版中的新函数。

例子：

>>> from pyspark.ml.linalg import Vectors
>>> data = [(Vectors.dense([0.0, 0.0]), 2.0), (Vectors.dense([1.0, 1.0]), 2.0),
...         (Vectors.dense([9.0, 8.0]), 2.0), (Vectors.dense([8.0, 9.0]), 2.0)]
>>> df = spark.createDataFrame(data, ["features", "weighCol"])
>>> kmeans = KMeans(k=2)
>>> kmeans.setSeed(1)
KMeans...
>>> kmeans.setWeightCol("weighCol")
KMeans...
>>> kmeans.setMaxIter(10)
KMeans...
>>> kmeans.getMaxIter()
10
>>> kmeans.clear(kmeans.maxIter)
>>> model = kmeans.fit(df)
>>> model.getDistanceMeasure()
'euclidean'
>>> model.setPredictionCol("newPrediction")
KMeansModel...
>>> model.predict(df.head().features)
0
>>> centers = model.clusterCenters()
>>> len(centers)
2
>>> transformed = model.transform(df).select("features", "newPrediction")
>>> rows = transformed.collect()
>>> rows[0].newPrediction == rows[1].newPrediction
True
>>> rows[2].newPrediction == rows[3].newPrediction
True
>>> model.hasSummary
True
>>> summary = model.summary
>>> summary.k
2
>>> summary.clusterSizes
[2, 2]
>>> summary.trainingCost
4.0
>>> kmeans_path = temp_path + "/kmeans"
>>> kmeans.save(kmeans_path)
>>> kmeans2 = KMeans.load(kmeans_path)
>>> kmeans2.getK()
2
>>> model_path = temp_path + "/kmeans_model"
>>> model.save(model_path)
>>> model2 = KMeansModel.load(model_path)
>>> model2.hasSummary
False
>>> model.clusterCenters()[0] == model2.clusterCenters()[0]
array([ True,  True], dtype=bool)
>>> model.clusterCenters()[1] == model2.clusterCenters()[1]
array([ True,  True], dtype=bool)
>>> model.transform(df).take(1) == model2.transform(df).take(1)
True

相关用法

注：本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.ml.clustering.KMeans。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。