Python pyspark BisectingKMeans用法及代碼示例

本文簡要介紹 pyspark.ml.clustering.BisectingKMeans 的用法。

用法: class pyspark.ml.clustering.BisectingKMeans(*, featuresCol='features', predictionCol='prediction', maxIter=20, seed=None, k=4, minDivisibleClusterSize=1.0, distanceMeasure='euclidean', weightCol=None)

一種基於 Steinbach、Karypis 和 Kumar 的論文“文檔聚類技術比較”的二等分 k-means 算法，經過修改以適應 Spark。該算法從包含所有點的單個集群開始。它迭代地在底層找到可分割的簇，並使用k-means將它們一分為二，直到總共有k葉簇或沒有葉簇是可分割的。同一級別的集群的二等分步驟被組合在一起以增加並行度。如果在底層平分所有可分割的集群會產生超過k 葉集群，更大的集群獲得更高的優先級。

2.0.0 版中的新函數。

例子：

>>> from pyspark.ml.linalg import Vectors
>>> data = [(Vectors.dense([0.0, 0.0]), 2.0), (Vectors.dense([1.0, 1.0]), 2.0),
...         (Vectors.dense([9.0, 8.0]), 2.0), (Vectors.dense([8.0, 9.0]), 2.0)]
>>> df = spark.createDataFrame(data, ["features", "weighCol"])
>>> bkm = BisectingKMeans(k=2, minDivisibleClusterSize=1.0)
>>> bkm.setMaxIter(10)
BisectingKMeans...
>>> bkm.getMaxIter()
10
>>> bkm.clear(bkm.maxIter)
>>> bkm.setSeed(1)
BisectingKMeans...
>>> bkm.setWeightCol("weighCol")
BisectingKMeans...
>>> bkm.getSeed()
1
>>> bkm.clear(bkm.seed)
>>> model = bkm.fit(df)
>>> model.getMaxIter()
20
>>> model.setPredictionCol("newPrediction")
BisectingKMeansModel...
>>> model.predict(df.head().features)
0
>>> centers = model.clusterCenters()
>>> len(centers)
2
>>> model.computeCost(df)
2.0
>>> model.hasSummary
True
>>> summary = model.summary
>>> summary.k
2
>>> summary.clusterSizes
[2, 2]
>>> summary.trainingCost
4.000...
>>> transformed = model.transform(df).select("features", "newPrediction")
>>> rows = transformed.collect()
>>> rows[0].newPrediction == rows[1].newPrediction
True
>>> rows[2].newPrediction == rows[3].newPrediction
True
>>> bkm_path = temp_path + "/bkm"
>>> bkm.save(bkm_path)
>>> bkm2 = BisectingKMeans.load(bkm_path)
>>> bkm2.getK()
2
>>> bkm2.getDistanceMeasure()
'euclidean'
>>> model_path = temp_path + "/bkm_model"
>>> model.save(model_path)
>>> model2 = BisectingKMeansModel.load(model_path)
>>> model2.hasSummary
False
>>> model.clusterCenters()[0] == model2.clusterCenters()[0]
array([ True,  True], dtype=bool)
>>> model.clusterCenters()[1] == model2.clusterCenters()[1]
array([ True,  True], dtype=bool)
>>> model.transform(df).take(1) == model2.transform(df).take(1)
True

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.ml.clustering.BisectingKMeans。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。