当前位置: 首页>>代码示例 >>用法及示例精选 >>正文


Python pyspark BisectingKMeans用法及代码示例


本文简要介绍 pyspark.ml.clustering.BisectingKMeans 的用法。

用法:

class pyspark.ml.clustering.BisectingKMeans(*, featuresCol='features', predictionCol='prediction', maxIter=20, seed=None, k=4, minDivisibleClusterSize=1.0, distanceMeasure='euclidean', weightCol=None)

一种基于 Steinbach、Karypis 和 Kumar 的论文“文档聚类技术比较”的二等分 k-means 算法,经过修改以适应 Spark。该算法从包含所有点的单个集群开始。它迭代地在底层找到可分割的簇,并使用k-means将它们一分为二,直到总共有k叶簇或没有叶簇是可分割的。同一级别的集群的二等分步骤被组合在一起以增加并行度。如果在底层平分所有可分割的集群会产生超过k 叶集群,更大的集群获得更高的优先级。

2.0.0 版中的新函数。

例子

>>> from pyspark.ml.linalg import Vectors
>>> data = [(Vectors.dense([0.0, 0.0]), 2.0), (Vectors.dense([1.0, 1.0]), 2.0),
...         (Vectors.dense([9.0, 8.0]), 2.0), (Vectors.dense([8.0, 9.0]), 2.0)]
>>> df = spark.createDataFrame(data, ["features", "weighCol"])
>>> bkm = BisectingKMeans(k=2, minDivisibleClusterSize=1.0)
>>> bkm.setMaxIter(10)
BisectingKMeans...
>>> bkm.getMaxIter()
10
>>> bkm.clear(bkm.maxIter)
>>> bkm.setSeed(1)
BisectingKMeans...
>>> bkm.setWeightCol("weighCol")
BisectingKMeans...
>>> bkm.getSeed()
1
>>> bkm.clear(bkm.seed)
>>> model = bkm.fit(df)
>>> model.getMaxIter()
20
>>> model.setPredictionCol("newPrediction")
BisectingKMeansModel...
>>> model.predict(df.head().features)
0
>>> centers = model.clusterCenters()
>>> len(centers)
2
>>> model.computeCost(df)
2.0
>>> model.hasSummary
True
>>> summary = model.summary
>>> summary.k
2
>>> summary.clusterSizes
[2, 2]
>>> summary.trainingCost
4.000...
>>> transformed = model.transform(df).select("features", "newPrediction")
>>> rows = transformed.collect()
>>> rows[0].newPrediction == rows[1].newPrediction
True
>>> rows[2].newPrediction == rows[3].newPrediction
True
>>> bkm_path = temp_path + "/bkm"
>>> bkm.save(bkm_path)
>>> bkm2 = BisectingKMeans.load(bkm_path)
>>> bkm2.getK()
2
>>> bkm2.getDistanceMeasure()
'euclidean'
>>> model_path = temp_path + "/bkm_model"
>>> model.save(model_path)
>>> model2 = BisectingKMeansModel.load(model_path)
>>> model2.hasSummary
False
>>> model.clusterCenters()[0] == model2.clusterCenters()[0]
array([ True,  True], dtype=bool)
>>> model.clusterCenters()[1] == model2.clusterCenters()[1]
array([ True,  True], dtype=bool)
>>> model.transform(df).take(1) == model2.transform(df).take(1)
True

相关用法


注:本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.ml.clustering.BisectingKMeans。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。