Python pyspark StreamingKMeansModel用法及代碼示例

本文簡要介紹 pyspark.mllib.clustering.StreamingKMeansModel 的用法。

用法: class pyspark.mllib.clustering.StreamingKMeansModel(clusterCenters, clusterWeights)

可以執行質心在線更新的聚類模型。

每個質心的更新公式由下式給出

c_t+1 = ((c_t * n_t * a) + (x_t * m_t)) /(n_t + m_t)
n_t+1 = n_t * a + m_t

其中

c_t：第 n 次迭代的質心。
n_t：第 n 次迭代時與質心相關聯的樣本(或)權重的數量。
x_t：最接近 c_t 的新數據的質心。
m_t：最接近 c_t 的新數據的樣本(或)權重數
c_t+1：新質心。
n_t+1：新的權重數。
a：衰減因子，它給出了健忘。

1.5.0 版中的新函數。

參數：

clusterCenters： pyspark.mllib.linalg.Vector 或可隱藏的列表: 初始聚類中心。
clusterWeights： pyspark.mllib.linalg.Vector 或可隱藏: 分配給每個集群的權重列表。

注意：

如果 a 設置為 1，則它是先前數據和新數據的加權平均值。如果它設置為零，則完全忘記舊的質心。

例子：

>>> initCenters = [[0.0, 0.0], [1.0, 1.0]]
>>> initWeights = [1.0, 1.0]
>>> stkm = StreamingKMeansModel(initCenters, initWeights)
>>> data = sc.parallelize([[-0.1, -0.1], [0.1, 0.1],
...                        [0.9, 0.9], [1.1, 1.1]])
>>> stkm = stkm.update(data, 1.0, "batches")
>>> stkm.centers
array([[ 0.,  0.],
       [ 1.,  1.]])
>>> stkm.predict([-0.1, -0.1])
0
>>> stkm.predict([0.9, 0.9])
1
>>> stkm.clusterWeights
[3.0, 3.0]
>>> decayFactor = 0.0
>>> data = sc.parallelize([DenseVector([1.5, 1.5]), DenseVector([0.2, 0.2])])
>>> stkm = stkm.update(data, 0.0, "batches")
>>> stkm.centers
array([[ 0.2,  0.2],
       [ 1.5,  1.5]])
>>> stkm.clusterWeights
[1.0, 1.0]
>>> stkm.predict([0.2, 0.2])
0
>>> stkm.predict([1.5, 1.5])
1

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.mllib.clustering.StreamingKMeansModel。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。