R kmeans K 均值聚類

R語言 kmeans 位於 stats 包(package)。

說明

對數據矩陣執行 k-means 聚類。

用法

kmeans(x, centers, iter.max = 10, nstart = 1,
       algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
                     "MacQueen"), trace = FALSE)
## S3 method for class 'kmeans'
fitted(object, method = c("centers", "classes"), ...)

參數

`x`	數據的數字矩陣，或可以強製轉換為此類矩陣的對象(例如數字向量或包含所有數字列的 DataFrame )。
`centers`	或者是簇的數量，比如 `k` ，或者是一組初始(不同的)簇中心。如果是數字，則選擇 `x` 中的一組隨機(不同)行作為初始中心。
`iter.max`	允許的最大迭代次數。
`nstart`	如果`centers`是一個數字，應該選擇多少個隨機集？
`algorithm`	字符：可以縮寫。請注意，`"Lloyd"` 和 `"Forgy"` 是一種算法的替代名稱。
`object`	一個R類的對象`"kmeans"`，通常結果`ob`的`ob <- kmeans(..)`.
`method`	字符：可以縮寫。 `"centers"` 導致 `fitted` 返回聚類中心(每個輸入點一個)，`"classes"` 導致 `fitted` 返回類分配向量。
`trace`	邏輯或整數，當前僅在默認方法(`"Hartigan-Wong"`)中使用：如果為正(或為真)，則生成有關算法進度的跟蹤信息。較高的值可能會產生更多的跟蹤信息。
`...`	不曾用過。

細節

x 給出的數據通過 k -means 方法進行聚類，該方法旨在將點劃分為 k 組，以使從點到指定聚類中心的平方和最小化。至少，所有聚類中心都位於其 Voronoi 集(最接近聚類中心的數據點集)的平均值。

默認使用 Hartigan 和 Wong (1979) 的算法。請注意，一些作者使用 k - 表示特定算法而不是通用方法：最常見的是 MacQueen (1967) 給出的算法，但有時是 Lloyd (1957) 和 Forgy (1965) 給出的算法。 Hartigan-Wong 算法通常比這兩種算法都做得更好，但通常建議嘗試多次隨機啟動 ( nstart > 1 )。在極少數情況下，當某些點( x 的行)非常接近時，算法可能不會在 “Quick-Transfer” 階段收斂，從而發出警告(並返回 ifault = 4 )。在這種情況下，建議對數據進行輕微舍入。

為了便於編程探索，允許 k = 1 ，特別是返回中心和 withinss 。

除了Lloyd-Forgy方法外，如果指定了數字，則始終返回k簇。如果提供了初始中心矩陣，則可能沒有點最接近一個或多個中心，這目前對於 Hartigan-Wong 方法來說是一個錯誤。

值

kmeans 返回 "kmeans" 類的對象，該類具有 print 和 fitted 方法。它是一個至少包含以下組成部分的列表：

`cluster`	整數向量(來自 `1:k` )，指示每個點分配到的簇。
`centers`	聚類中心矩陣。
`totss`	總平方和。
`withinss`	簇內平方和的向量，每個簇一個分量。
`tot.withinss`	簇內總平方和，即 `sum(withinss)` 。
`betweenss`	between-cluster 平方和，即 `totss-tot.withinss` 。
`size`	每個簇中的點數。
`iter`	(外部)迭代的次數。
`ifault`	整數：可能的算法問題的指示符 - 對於專家來說。

注意

簇在返回的對象中進行編號，但它們是一個集合，並且不暗示任何順序。 (它們的明顯順序可能因平台而異。)

例子

require(graphics)

# a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, 2))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)

# sum of squares
ss <- function(x) sum(scale(x, scale = FALSE)^2)

## cluster centers "fitted" to each obs.:
fitted.x <- fitted(cl);  head(fitted.x)
resid.x <- x - fitted(cl)

## Equalities : ----------------------------------
cbind(cl[c("betweenss", "tot.withinss", "totss")], # the same two columns
         c(ss(fitted.x), ss(resid.x),    ss(x)))
stopifnot(all.equal(cl$ totss,        ss(x)),
	  all.equal(cl$ tot.withinss, ss(resid.x)),
	  ## these three are the same:
	  all.equal(cl$ betweenss,    ss(fitted.x)),
	  all.equal(cl$ betweenss, cl$totss - cl$tot.withinss),
	  ## and hence also
	  all.equal(ss(x), ss(fitted.x) + ss(resid.x))
	  )

kmeans(x,1)$withinss # trivial one-cluster, (its W.SS == ss(x))

## random starts do help here with too many clusters
## (and are often recommended anyway!):
## The ordering of the clusters may be platform-dependent.
## IGNORE_RDIFF_BEGIN
(cl <- kmeans(x, 5, nstart = 25))
## IGNORE_RDIFF_END
plot(x, col = cl$cluster)
points(cl$centers, col = 1:5, pch = 8)

參考

Forgy, E. W. (1965). Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics, 21, 768-769.

Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 28, 100-108. doi:10.2307/2346830.

Lloyd, S. P. (1957, 1982). Least squares quantization in PCM. Technical Note, Bell Laboratories. Published in 1982 in IEEE Transactions on Information Theory, 28, 128-137.

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281-297. Berkeley, CA: University of California Press.

相關用法

注：本文由純淨天空篩選整理自R-devel大神的英文原創作品 K-Means Clustering。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。