本文簡要介紹
pyspark.ml.clustering.BisectingKMeans
的用法。用法:
class pyspark.ml.clustering.BisectingKMeans(*, featuresCol='features', predictionCol='prediction', maxIter=20, seed=None, k=4, minDivisibleClusterSize=1.0, distanceMeasure='euclidean', weightCol=None)
一種基於 Steinbach、Karypis 和 Kumar 的論文“文檔聚類技術比較”的二等分 k-means 算法,經過修改以適應 Spark。該算法從包含所有點的單個集群開始。它迭代地在底層找到可分割的簇,並使用k-means將它們一分為二,直到總共有
k
葉簇或沒有葉簇是可分割的。同一級別的集群的二等分步驟被組合在一起以增加並行度。如果在底層平分所有可分割的集群會產生超過k
葉集群,更大的集群獲得更高的優先級。2.0.0 版中的新函數。
例子:
>>> from pyspark.ml.linalg import Vectors >>> data = [(Vectors.dense([0.0, 0.0]), 2.0), (Vectors.dense([1.0, 1.0]), 2.0), ... (Vectors.dense([9.0, 8.0]), 2.0), (Vectors.dense([8.0, 9.0]), 2.0)] >>> df = spark.createDataFrame(data, ["features", "weighCol"]) >>> bkm = BisectingKMeans(k=2, minDivisibleClusterSize=1.0) >>> bkm.setMaxIter(10) BisectingKMeans... >>> bkm.getMaxIter() 10 >>> bkm.clear(bkm.maxIter) >>> bkm.setSeed(1) BisectingKMeans... >>> bkm.setWeightCol("weighCol") BisectingKMeans... >>> bkm.getSeed() 1 >>> bkm.clear(bkm.seed) >>> model = bkm.fit(df) >>> model.getMaxIter() 20 >>> model.setPredictionCol("newPrediction") BisectingKMeansModel... >>> model.predict(df.head().features) 0 >>> centers = model.clusterCenters() >>> len(centers) 2 >>> model.computeCost(df) 2.0 >>> model.hasSummary True >>> summary = model.summary >>> summary.k 2 >>> summary.clusterSizes [2, 2] >>> summary.trainingCost 4.000... >>> transformed = model.transform(df).select("features", "newPrediction") >>> rows = transformed.collect() >>> rows[0].newPrediction == rows[1].newPrediction True >>> rows[2].newPrediction == rows[3].newPrediction True >>> bkm_path = temp_path + "/bkm" >>> bkm.save(bkm_path) >>> bkm2 = BisectingKMeans.load(bkm_path) >>> bkm2.getK() 2 >>> bkm2.getDistanceMeasure() 'euclidean' >>> model_path = temp_path + "/bkm_model" >>> model.save(model_path) >>> model2 = BisectingKMeansModel.load(model_path) >>> model2.hasSummary False >>> model.clusterCenters()[0] == model2.clusterCenters()[0] array([ True, True], dtype=bool) >>> model.clusterCenters()[1] == model2.clusterCenters()[1] array([ True, True], dtype=bool) >>> model.transform(df).take(1) == model2.transform(df).take(1) True
相關用法
- Python pyspark BisectingKMeansModel用法及代碼示例
- Python pyspark BinaryClassificationEvaluator用法及代碼示例
- Python pyspark BinaryClassificationMetrics用法及代碼示例
- Python pyspark Binarizer用法及代碼示例
- Python pyspark BlockMatrix.add用法及代碼示例
- Python pyspark BlockMatrix.colsPerBlock用法及代碼示例
- Python pyspark BlockMatrix.subtract用法及代碼示例
- Python pyspark BlockMatrix.toLocalMatrix用法及代碼示例
- Python pyspark BlockMatrix.toIndexedRowMatrix用法及代碼示例
- Python pyspark BlockMatrix.rowsPerBlock用法及代碼示例
- Python pyspark BlockMatrix.numCols用法及代碼示例
- Python pyspark BlockMatrix.numColBlocks用法及代碼示例
- Python pyspark BlockMatrix.numRowBlocks用法及代碼示例
- Python pyspark Bucketizer用法及代碼示例
- Python pyspark BlockMatrix.toCoordinateMatrix用法及代碼示例
- Python pyspark BucketedRandomProjectionLSH用法及代碼示例
- Python pyspark BlockMatrix.transpose用法及代碼示例
- Python pyspark BlockMatrix.numRows用法及代碼示例
- Python pyspark BlockMatrix.multiply用法及代碼示例
- Python pyspark Broadcast用法及代碼示例
- Python pyspark BlockMatrix.blocks用法及代碼示例
- Python pyspark create_map用法及代碼示例
- Python pyspark date_add用法及代碼示例
- Python pyspark DataFrame.to_latex用法及代碼示例
- Python pyspark DataStreamReader.schema用法及代碼示例
注:本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.ml.clustering.BisectingKMeans。非經特殊聲明,原始代碼版權歸原作者所有,本譯文未經允許或授權,請勿轉載或複製。