Python pyspark GaussianMixture用法及代码示例

本文简要介绍 pyspark.ml.clustering.GaussianMixture 的用法。

用法: class pyspark.ml.clustering.GaussianMixture(*, featuresCol='features', predictionCol='prediction', k=2, probabilityCol='probability', tol=0.01, maxIter=100, seed=None, aggregationDepth=2, weightCol=None)

GaussianMixture聚类。此类执行多元高斯混合模型 (GMM) 的期望最大化。 GMM 表示独立高斯分布的复合分布，以及关联的 “mixing” 权重，指定每个分布对复合分布的贡献。

给定一组样本点，此类将最大化 k 个高斯混合的对数似然，迭代直到对数似然变化小于收敛Tol，或者直到达到最大迭代次数。虽然这个过程通常保证收敛，但不能保证找到全局最优值。

2.0.0 版中的新函数。

注意：

对于高维数据(具有许多特征)，该算法可能表现不佳。这是由于高维数据(a)使其难以聚类(基于统计/理论参数)和(b)高斯分布的数值问题。

例子：

>>> from pyspark.ml.linalg import Vectors

>>> data = [(Vectors.dense([-0.1, -0.05 ]),),
...         (Vectors.dense([-0.01, -0.1]),),
...         (Vectors.dense([0.9, 0.8]),),
...         (Vectors.dense([0.75, 0.935]),),
...         (Vectors.dense([-0.83, -0.68]),),
...         (Vectors.dense([-0.91, -0.76]),)]
>>> df = spark.createDataFrame(data, ["features"])
>>> gm = GaussianMixture(k=3, tol=0.0001, seed=10)
>>> gm.getMaxIter()
100
>>> gm.setMaxIter(30)
GaussianMixture...
>>> gm.getMaxIter()
30
>>> model = gm.fit(df)
>>> model.getAggregationDepth()
2
>>> model.getFeaturesCol()
'features'
>>> model.setPredictionCol("newPrediction")
GaussianMixtureModel...
>>> model.predict(df.head().features)
2
>>> model.predictProbability(df.head().features)
DenseVector([0.0, 0.0, 1.0])
>>> model.hasSummary
True
>>> summary = model.summary
>>> summary.k
3
>>> summary.clusterSizes
[2, 2, 2]
>>> weights = model.weights
>>> len(weights)
3
>>> gaussians = model.gaussians
>>> len(gaussians)
3
>>> gaussians[0].mean
DenseVector([0.825, 0.8675])
>>> gaussians[0].cov
DenseMatrix(2, 2, [0.0056, -0.0051, -0.0051, 0.0046], 0)
>>> gaussians[1].mean
DenseVector([-0.87, -0.72])
>>> gaussians[1].cov
DenseMatrix(2, 2, [0.0016, 0.0016, 0.0016, 0.0016], 0)
>>> gaussians[2].mean
DenseVector([-0.055, -0.075])
>>> gaussians[2].cov
DenseMatrix(2, 2, [0.002, -0.0011, -0.0011, 0.0006], 0)
>>> model.gaussiansDF.select("mean").head()
Row(mean=DenseVector([0.825, 0.8675]))
>>> model.gaussiansDF.select("cov").head()
Row(cov=DenseMatrix(2, 2, [0.0056, -0.0051, -0.0051, 0.0046], False))
>>> transformed = model.transform(df).select("features", "newPrediction")
>>> rows = transformed.collect()
>>> rows[4].newPrediction == rows[5].newPrediction
True
>>> rows[2].newPrediction == rows[3].newPrediction
True
>>> gmm_path = temp_path + "/gmm"
>>> gm.save(gmm_path)
>>> gm2 = GaussianMixture.load(gmm_path)
>>> gm2.getK()
3
>>> model_path = temp_path + "/gmm_model"
>>> model.save(model_path)
>>> model2 = GaussianMixtureModel.load(model_path)
>>> model2.hasSummary
False
>>> model2.weights == model.weights
True
>>> model2.gaussians[0].mean == model.gaussians[0].mean
True
>>> model2.gaussians[0].cov == model.gaussians[0].cov
True
>>> model2.gaussians[1].mean == model.gaussians[1].mean
True
>>> model2.gaussians[1].cov == model.gaussians[1].cov
True
>>> model2.gaussians[2].mean == model.gaussians[2].mean
True
>>> model2.gaussians[2].cov == model.gaussians[2].cov
True
>>> model2.gaussiansDF.select("mean").head()
Row(mean=DenseVector([0.825, 0.8675]))
>>> model2.gaussiansDF.select("cov").head()
Row(cov=DenseMatrix(2, 2, [0.0056, -0.0051, -0.0051, 0.0046], False))
>>> model.transform(df).take(1) == model2.transform(df).take(1)
True
>>> gm2.setWeightCol("weight")
GaussianMixture...

相关用法

注：本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.ml.clustering.GaussianMixture。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。