Python pyspark GaussianMixture用法及代碼示例

本文簡要介紹 pyspark.ml.clustering.GaussianMixture 的用法。

用法: class pyspark.ml.clustering.GaussianMixture(*, featuresCol='features', predictionCol='prediction', k=2, probabilityCol='probability', tol=0.01, maxIter=100, seed=None, aggregationDepth=2, weightCol=None)

GaussianMixture聚類。此類執行多元高斯混合模型 (GMM) 的期望最大化。 GMM 表示獨立高斯分布的複合分布，以及關聯的 “mixing” 權重，指定每個分布對複合分布的貢獻。

給定一組樣本點，此類將最大化 k 個高斯混合的對數似然，迭代直到對數似然變化小於收斂Tol，或者直到達到最大迭代次數。雖然這個過程通常保證收斂，但不能保證找到全局最優值。

2.0.0 版中的新函數。

注意：

對於高維數據(具有許多特征)，該算法可能表現不佳。這是由於高維數據(a)使其難以聚類(基於統計/理論參數)和(b)高斯分布的數值問題。

例子：

>>> from pyspark.ml.linalg import Vectors

>>> data = [(Vectors.dense([-0.1, -0.05 ]),),
...         (Vectors.dense([-0.01, -0.1]),),
...         (Vectors.dense([0.9, 0.8]),),
...         (Vectors.dense([0.75, 0.935]),),
...         (Vectors.dense([-0.83, -0.68]),),
...         (Vectors.dense([-0.91, -0.76]),)]
>>> df = spark.createDataFrame(data, ["features"])
>>> gm = GaussianMixture(k=3, tol=0.0001, seed=10)
>>> gm.getMaxIter()
100
>>> gm.setMaxIter(30)
GaussianMixture...
>>> gm.getMaxIter()
30
>>> model = gm.fit(df)
>>> model.getAggregationDepth()
2
>>> model.getFeaturesCol()
'features'
>>> model.setPredictionCol("newPrediction")
GaussianMixtureModel...
>>> model.predict(df.head().features)
2
>>> model.predictProbability(df.head().features)
DenseVector([0.0, 0.0, 1.0])
>>> model.hasSummary
True
>>> summary = model.summary
>>> summary.k
3
>>> summary.clusterSizes
[2, 2, 2]
>>> weights = model.weights
>>> len(weights)
3
>>> gaussians = model.gaussians
>>> len(gaussians)
3
>>> gaussians[0].mean
DenseVector([0.825, 0.8675])
>>> gaussians[0].cov
DenseMatrix(2, 2, [0.0056, -0.0051, -0.0051, 0.0046], 0)
>>> gaussians[1].mean
DenseVector([-0.87, -0.72])
>>> gaussians[1].cov
DenseMatrix(2, 2, [0.0016, 0.0016, 0.0016, 0.0016], 0)
>>> gaussians[2].mean
DenseVector([-0.055, -0.075])
>>> gaussians[2].cov
DenseMatrix(2, 2, [0.002, -0.0011, -0.0011, 0.0006], 0)
>>> model.gaussiansDF.select("mean").head()
Row(mean=DenseVector([0.825, 0.8675]))
>>> model.gaussiansDF.select("cov").head()
Row(cov=DenseMatrix(2, 2, [0.0056, -0.0051, -0.0051, 0.0046], False))
>>> transformed = model.transform(df).select("features", "newPrediction")
>>> rows = transformed.collect()
>>> rows[4].newPrediction == rows[5].newPrediction
True
>>> rows[2].newPrediction == rows[3].newPrediction
True
>>> gmm_path = temp_path + "/gmm"
>>> gm.save(gmm_path)
>>> gm2 = GaussianMixture.load(gmm_path)
>>> gm2.getK()
3
>>> model_path = temp_path + "/gmm_model"
>>> model.save(model_path)
>>> model2 = GaussianMixtureModel.load(model_path)
>>> model2.hasSummary
False
>>> model2.weights == model.weights
True
>>> model2.gaussians[0].mean == model.gaussians[0].mean
True
>>> model2.gaussians[0].cov == model.gaussians[0].cov
True
>>> model2.gaussians[1].mean == model.gaussians[1].mean
True
>>> model2.gaussians[1].cov == model.gaussians[1].cov
True
>>> model2.gaussians[2].mean == model.gaussians[2].mean
True
>>> model2.gaussians[2].cov == model.gaussians[2].cov
True
>>> model2.gaussiansDF.select("mean").head()
Row(mean=DenseVector([0.825, 0.8675]))
>>> model2.gaussiansDF.select("cov").head()
Row(cov=DenseMatrix(2, 2, [0.0056, -0.0051, -0.0051, 0.0046], False))
>>> model.transform(df).take(1) == model2.transform(df).take(1)
True
>>> gm2.setWeightCol("weight")
GaussianMixture...

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.ml.clustering.GaussianMixture。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。