本文简要介绍
pyspark.ml.clustering.BisectingKMeans
的用法。用法:
class pyspark.ml.clustering.BisectingKMeans(*, featuresCol='features', predictionCol='prediction', maxIter=20, seed=None, k=4, minDivisibleClusterSize=1.0, distanceMeasure='euclidean', weightCol=None)
一种基于 Steinbach、Karypis 和 Kumar 的论文“文档聚类技术比较”的二等分 k-means 算法,经过修改以适应 Spark。该算法从包含所有点的单个集群开始。它迭代地在底层找到可分割的簇,并使用k-means将它们一分为二,直到总共有
k
叶簇或没有叶簇是可分割的。同一级别的集群的二等分步骤被组合在一起以增加并行度。如果在底层平分所有可分割的集群会产生超过k
叶集群,更大的集群获得更高的优先级。2.0.0 版中的新函数。
例子:
>>> from pyspark.ml.linalg import Vectors >>> data = [(Vectors.dense([0.0, 0.0]), 2.0), (Vectors.dense([1.0, 1.0]), 2.0), ... (Vectors.dense([9.0, 8.0]), 2.0), (Vectors.dense([8.0, 9.0]), 2.0)] >>> df = spark.createDataFrame(data, ["features", "weighCol"]) >>> bkm = BisectingKMeans(k=2, minDivisibleClusterSize=1.0) >>> bkm.setMaxIter(10) BisectingKMeans... >>> bkm.getMaxIter() 10 >>> bkm.clear(bkm.maxIter) >>> bkm.setSeed(1) BisectingKMeans... >>> bkm.setWeightCol("weighCol") BisectingKMeans... >>> bkm.getSeed() 1 >>> bkm.clear(bkm.seed) >>> model = bkm.fit(df) >>> model.getMaxIter() 20 >>> model.setPredictionCol("newPrediction") BisectingKMeansModel... >>> model.predict(df.head().features) 0 >>> centers = model.clusterCenters() >>> len(centers) 2 >>> model.computeCost(df) 2.0 >>> model.hasSummary True >>> summary = model.summary >>> summary.k 2 >>> summary.clusterSizes [2, 2] >>> summary.trainingCost 4.000... >>> transformed = model.transform(df).select("features", "newPrediction") >>> rows = transformed.collect() >>> rows[0].newPrediction == rows[1].newPrediction True >>> rows[2].newPrediction == rows[3].newPrediction True >>> bkm_path = temp_path + "/bkm" >>> bkm.save(bkm_path) >>> bkm2 = BisectingKMeans.load(bkm_path) >>> bkm2.getK() 2 >>> bkm2.getDistanceMeasure() 'euclidean' >>> model_path = temp_path + "/bkm_model" >>> model.save(model_path) >>> model2 = BisectingKMeansModel.load(model_path) >>> model2.hasSummary False >>> model.clusterCenters()[0] == model2.clusterCenters()[0] array([ True, True], dtype=bool) >>> model.clusterCenters()[1] == model2.clusterCenters()[1] array([ True, True], dtype=bool) >>> model.transform(df).take(1) == model2.transform(df).take(1) True
相关用法
- Python pyspark BisectingKMeansModel用法及代码示例
- Python pyspark BinaryClassificationEvaluator用法及代码示例
- Python pyspark BinaryClassificationMetrics用法及代码示例
- Python pyspark Binarizer用法及代码示例
- Python pyspark BlockMatrix.add用法及代码示例
- Python pyspark BlockMatrix.colsPerBlock用法及代码示例
- Python pyspark BlockMatrix.subtract用法及代码示例
- Python pyspark BlockMatrix.toLocalMatrix用法及代码示例
- Python pyspark BlockMatrix.toIndexedRowMatrix用法及代码示例
- Python pyspark BlockMatrix.rowsPerBlock用法及代码示例
- Python pyspark BlockMatrix.numCols用法及代码示例
- Python pyspark BlockMatrix.numColBlocks用法及代码示例
- Python pyspark BlockMatrix.numRowBlocks用法及代码示例
- Python pyspark Bucketizer用法及代码示例
- Python pyspark BlockMatrix.toCoordinateMatrix用法及代码示例
- Python pyspark BucketedRandomProjectionLSH用法及代码示例
- Python pyspark BlockMatrix.transpose用法及代码示例
- Python pyspark BlockMatrix.numRows用法及代码示例
- Python pyspark BlockMatrix.multiply用法及代码示例
- Python pyspark Broadcast用法及代码示例
- Python pyspark BlockMatrix.blocks用法及代码示例
- Python pyspark create_map用法及代码示例
- Python pyspark date_add用法及代码示例
- Python pyspark DataFrame.to_latex用法及代码示例
- Python pyspark DataStreamReader.schema用法及代码示例
注:本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.ml.clustering.BisectingKMeans。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。