R clusGap 用於估計聚類數量的間隙統計

R語言 clusGap 位於 cluster 包(package)。

說明

clusGap() 計算聚類度量的優度，即 “gap” 統計量。對於每個數量的簇 k ，它將 \log(W(k)) 與 E^*[\log(W(k))] 進行比較，其中後者是通過引導定義的，即從參考 (H_0 ) 分布進行模擬，超立方體上的均勻分布由以下範圍確定x ，首先居中，然後 svd (又名“PCA”) - 在(默認情況下) spaceH0 = "scaledPCA" 時旋轉它們。

maxSE(f, SE.f) 確定 f 最大值的位置，並考慮 *SE* 方法的“1-SE 規則”。默認方法 "firstSEmax" 查找最小的 k，使其值 f(k) 與第一個局部最大值相差不超過 1 個標準誤差。這與 "Tibs2001SEmax" 類似但不相同，Tibshirani 等人建議根據間隙統計數據及其標準差確定簇的數量。

用法

clusGap(x, FUNcluster, K.max, B = 100, d.power = 1,
        spaceH0 = c("scaledPCA", "original"),
        verbose = interactive(), ...)

maxSE(f, SE.f,
      method = c("firstSEmax", "Tibs2001SEmax", "globalSEmax",
                 "firstmax", "globalmax"),
      SE.factor = 1)

## S3 method for class 'clusGap'
print(x, method = "firstSEmax", SE.factor = 1, ...)

## S3 method for class 'clusGap'
plot(x, type = "b", xlab = "k", ylab = expression(Gap[k]),
     main = NULL, do.arrows = TRUE,
     arrowArgs = list(col="red3", length=1/16, angle=90, code=3), ...)

參數

`x`	數字矩陣或`data.frame`。
`FUNcluster`	`function` 接受像 `x` 這樣的(數據)矩陣作為第一個參數，第二個參數，比如 `k, k\geq 2` ，所需的簇數，並返回一個 `list` ，其組件名為(或縮寫為) `cluster` 這是 `1:k` 中整數的長度 `n = nrow(x)` 的向量，確定 `n` 觀測值的聚類或分組。
`K.max`	要考慮的最大簇數必須至少為兩個。
`B`	整數，蒙特卡羅 (“bootstrap”) 樣本數。
`d.power`	指定冪的正整數`p`應用於歐氏距離(`dist`)在它們被總結出來之前`W(k)`。默認情況下，`d.power = 1`，對應“historical”R實施，而`d.power = 2`符合 Tibshirani 等人的提議。這是胡安·岡薩雷斯 (Juan Gonzalez) 在 2016 年 2 月發現的。
`spaceH0`	`character` 字符串，指定 `H_0` 分布(無簇)的空間。 `"scaledPCA"` 和 `"original"` 都在超立方體中使用均勻分布，並且已在參考文獻中提到； `"original"` 是在 Juan Gonzalez 提出提案(包括代碼)後添加的。
`verbose`	整數或邏輯，確定是否應打印 “progress” 輸出。默認情況下，每個引導樣本打印一位。
`...`	(對於 `clusGap()` :) `FUNcluster()` 可選更多參數，請參閱下麵的 `kmeans` 示例。
`f`	“函數值”的數字向量，長度為 `K` ，我們想要其最大值(“1 SE尊重”)。
`SE.f`	長度為 `K` 的標準錯誤的數值向量 `f` 。
`method`	字符串，指示如何根據間隙統計數據(及其標準差)計算 “optimal” 簇數 `\hat k` ，或者更一般地說，應如何確定 `f_k` 最大值的位置 `\hat k` 。 `"globalmax"`：簡單地對應於全局最大值，即`which.max(f)` `"firstmax"`：給出第一個局部最大值的位置. `"Tibs2001SEmax"`：使用Tibshirani等人(2001)提出的標準：“最小的`k`使得`f(k) \ge f(k+1) - s_{k+1}`”。請注意，當所有標準差都大於差異 `f(k+1) - f(k)` 時，將選擇 `k = 1` 。 `"firstSEmax"`：第一個`f()`值不小於第一個局部最大值減去`SE.factor * SE.f[]`的位置，即在“f S.E.”內該最大值的範圍(另請參見 `SE.factor` )。這是默認值，由 Martin Maechler 在 2012 年提出，當時在看到 `"globalSEmax"` 提案(代碼)並閱讀了 `"Tibs2001SEmax"` 提案後，將 `clusGap()` 添加到 `cluster` 包中。 `"globalSEmax"`： (在 Dudoit 和 Fridlyand (2002) 中使用，據說遵循 Tibshirani 的命題)：第一個 `f()` 值的位置，該值不小於全局最大值減去 `SE.factor * SE.f[]` ，即在“f S.E.”內該最大值的範圍(另請參見 `SE.factor` )。請參閱示例以進行簡單情況的比較。
`SE.factor`	[當 `method` 包含 `"SE"` 時] 確定最佳簇數，Tibshirani 等人。提出了“1 S.E.”規則。使用 `SE.factor` `f` 時，更普遍地使用“f S.E.”規則。
`type` , `xlab` , `ylab` , `main`	參數與 `plot.default()` 中的含義相同，但默認值不同。
`do.arrows`	邏輯指示是否應通過 `arrows()` 繪製 (1 SE -)“error bars” 。
`arrowArgs`	傳遞給 `arrows()` 的參數列表；默認值，尤其是 `angle` 和 `code` ，提供與常用誤差線匹配的樣式。

細節

主要結果<res>$Tab[,"gap"] 當然是來自引導(又稱為蒙特卡羅模擬)，因此是隨機的，或者等效地，取決於初始隨機種子(請參閱set.seed() )。另一方麵，根據我們的經驗，使用 B = 500 給出了相當精確的結果，使得間隙圖在另一次運行後本質上沒有變化。

值

clusGap(..) 返回 S3 類 "clusGap" 的對象，本質上是一個包含組件的列表

`Tab`	具有 `K.max` 行和 4 列的矩陣，名為 "logW"、"E.logW"、"gap" 和 "SE.sim"，其中 `gap = E.logW - logW` 和 `SE.sim` 對應於 `gap` 的標準誤差，並且 `SE.sim[k]=s_k` ，其中 `s_k := \sqrt{1 + 1/B} sd^(gap_j)` 和 `sd^()` 是模擬 (“bootstrapped”) 間隙值的標準偏差。
`call`	`clusGap(..)` `call` 。
`spaceH0`	`spaceH0` 參數(`match.arg()` 編輯)。
`n`	觀察數，即 `nrow(x)` 。
`B`	輸入`B`
`FUNcluster`	輸入函數`FUNcluster`

例子

### --- maxSE() methods -------------------------------------------
(mets <- eval(formals(maxSE)$method))
fk <- c(2,3,5,4,7,8,5,4)
sk <- c(1,1,2,1,1,3,1,1)/2
## use plot.clusGap():
plot(structure(class="clusGap", list(Tab = cbind(gap=fk, SE.sim=sk))))
## Note that 'firstmax' and 'globalmax' are always at 3 and 6 :
sapply(c(1/4, 1,2,4), function(SEf)
        sapply(mets, function(M) maxSE(fk, sk, method = M, SE.factor = SEf)))

### --- clusGap() -------------------------------------------------
## ridiculously nicely separated clusters in 3 D :
x <- rbind(matrix(rnorm(150,           sd = 0.1), ncol = 3),
           matrix(rnorm(150, mean = 1, sd = 0.1), ncol = 3),
           matrix(rnorm(150, mean = 2, sd = 0.1), ncol = 3),
           matrix(rnorm(150, mean = 3, sd = 0.1), ncol = 3))

## Slightly faster way to use pam (see below)
pam1 <- function(x,k) list(cluster = pam(x,k, cluster.only=TRUE))

## We do not recommend using hier.clustering here, but if you want,
## there is  factoextra::hcut () or a cheap version of it
hclusCut <- function(x, k, d.meth = "euclidean", ...)
   list(cluster = cutree(hclust(dist(x, method=d.meth), ...), k=k))

## You can manually set it before running this :    doExtras <- TRUE  # or  FALSE
if(!(exists("doExtras") && is.logical(doExtras)))
  doExtras <- cluster:::doExtras()

if(doExtras) {
  ## Note we use  B = 60 in the following examples to keep them "speedy".
  ## ---- rather keep the default B = 500 for your analysis!

  ## note we can  pass 'nstart = 20' to kmeans() :
  gskmn <- clusGap(x, FUN = kmeans, nstart = 20, K.max = 8, B = 60)
  gskmn #-> its print() method
  plot(gskmn, main = "clusGap(., FUN = kmeans, n.start=20, B= 60)")
  set.seed(12); system.time(
    gsPam0 <- clusGap(x, FUN = pam, K.max = 8, B = 60)
  )
  set.seed(12); system.time(
    gsPam1 <- clusGap(x, FUN = pam1, K.max = 8, B = 60)
  )
  ## and show that it gives the "same":
  not.eq <- c("call", "FUNcluster"); n <- names(gsPam0)
  eq <- n[!(n %in% not.eq)]
  stopifnot(identical(gsPam1[eq], gsPam0[eq]))
  print(gsPam1, method="globalSEmax")
  print(gsPam1, method="globalmax")

  print(gsHc <- clusGap(x, FUN = hclusCut, K.max = 8, B = 60))

}# end {doExtras}

gs.pam.RU <- clusGap(ruspini, FUN = pam1, K.max = 8, B = 60)
gs.pam.RU
plot(gs.pam.RU, main = "Gap statistic for the 'ruspini' data")
mtext("k = 4 is best .. and  k = 5  pretty close")

## This takes a minute..
## No clustering ==> k = 1 ("one cluster") should be optimal:
Z <- matrix(rnorm(256*3), 256,3)
gsP.Z <- clusGap(Z, FUN = pam1, K.max = 8, B = 200)
plot(gsP.Z, main = "clusGap(<iid_rnorm_p=3>)  ==> k = 1  cluster is optimal")
gsP.Z

作者

This function is originally based on the functions gap of former (Bioconductor) package SAGx by Per Broberg, gapStat() from former package SLmisc by Matthias Kohl and ideas from gap() and its methods of package lga by Justin Harrington.

The current implementation is by Martin Maechler.

The implementation of spaceH0 = "original" is based on code proposed by Juan Gonzalez.

參考

Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411-423.

Tibshirani, R., Walther, G. and Hastie, T. (2000). Estimating the number of clusters in a dataset via the Gap statistic. Technical Report. Stanford.

Dudoit, S. and Fridlyand, J. (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3(7). doi:10.1186/gb-2002-3-7-research0036

Per Broberg (2006). SAGx: Statistical Analysis of the GeneChip. R package version 1.9.7. http://bioconductor.org/packages/3.12/bioc/html/SAGx.html

也可以看看

silhouette 用於更簡單、不太複雜的聚類度量。

fpc 包中的 cluster.stats() 用於替代措施。

相關用法

注：本文由純淨天空篩選整理自R-devel大神的英文原創作品 Gap Statistic for Estimating the Number of Clusters。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。