R clusGap 用于估计聚类数量的间隙统计

R语言 clusGap 位于 cluster 包(package)。

说明

clusGap() 计算聚类度量的优度，即 “gap” 统计量。对于每个数量的簇 k ，它将 \log(W(k)) 与 E^*[\log(W(k))] 进行比较，其中后者是通过引导定义的，即从参考 (H_0 ) 分布进行模拟，超立方体上的均匀分布由以下范围确定x ，首先居中，然后 svd (又名“PCA”) - 在(默认情况下) spaceH0 = "scaledPCA" 时旋转它们。

maxSE(f, SE.f) 确定 f 最大值的位置，并考虑 *SE* 方法的“1-SE 规则”。默认方法 "firstSEmax" 查找最小的 k，使其值 f(k) 与第一个局部最大值相差不超过 1 个标准误差。这与 "Tibs2001SEmax" 类似但不相同，Tibshirani 等人建议根据间隙统计数据及其标准差确定簇的数量。

用法

clusGap(x, FUNcluster, K.max, B = 100, d.power = 1,
        spaceH0 = c("scaledPCA", "original"),
        verbose = interactive(), ...)

maxSE(f, SE.f,
      method = c("firstSEmax", "Tibs2001SEmax", "globalSEmax",
                 "firstmax", "globalmax"),
      SE.factor = 1)

## S3 method for class 'clusGap'
print(x, method = "firstSEmax", SE.factor = 1, ...)

## S3 method for class 'clusGap'
plot(x, type = "b", xlab = "k", ylab = expression(Gap[k]),
     main = NULL, do.arrows = TRUE,
     arrowArgs = list(col="red3", length=1/16, angle=90, code=3), ...)

参数

`x`	数字矩阵或`data.frame`。
`FUNcluster`	`function` 接受像 `x` 这样的(数据)矩阵作为第一个参数，第二个参数，比如 `k, k\geq 2` ，所需的簇数，并返回一个 `list` ，其组件名为(或缩写为) `cluster` 这是 `1:k` 中整数的长度 `n = nrow(x)` 的向量，确定 `n` 观测值的聚类或分组。
`K.max`	要考虑的最大簇数必须至少为两个。
`B`	整数，蒙特卡罗 (“bootstrap”) 样本数。
`d.power`	指定幂的正整数`p`应用于欧氏距离(`dist`)在它们被总结出来之前`W(k)`。默认情况下，`d.power = 1`，对应“historical”R实施，而`d.power = 2`符合 Tibshirani 等人的提议。这是胡安·冈萨雷斯 (Juan Gonzalez) 在 2016 年 2 月发现的。
`spaceH0`	`character` 字符串，指定 `H_0` 分布(无簇)的空间。 `"scaledPCA"` 和 `"original"` 都在超立方体中使用均匀分布，并且已在参考文献中提到； `"original"` 是在 Juan Gonzalez 提出提案(包括代码)后添加的。
`verbose`	整数或逻辑，确定是否应打印 “progress” 输出。默认情况下，每个引导样本打印一位。
`...`	(对于 `clusGap()` :) `FUNcluster()` 可选更多参数，请参阅下面的 `kmeans` 示例。
`f`	“函数值”的数字向量，长度为 `K` ，我们想要其最大值(“1 SE尊重”)。
`SE.f`	长度为 `K` 的标准错误的数值向量 `f` 。
`method`	字符串，指示如何根据间隙统计数据(及其标准差)计算 “optimal” 簇数 `\hat k` ，或者更一般地说，应如何确定 `f_k` 最大值的位置 `\hat k` 。 `"globalmax"`：简单地对应于全局最大值，即`which.max(f)` `"firstmax"`：给出第一个局部最大值的位置. `"Tibs2001SEmax"`：使用Tibshirani等人(2001)提出的标准：“最小的`k`使得`f(k) \ge f(k+1) - s_{k+1}`”。请注意，当所有标准差都大于差异 `f(k+1) - f(k)` 时，将选择 `k = 1` 。 `"firstSEmax"`：第一个`f()`值不小于第一个局部最大值减去`SE.factor * SE.f[]`的位置，即在“f S.E.”内该最大值的范围(另请参见 `SE.factor` )。这是默认值，由 Martin Maechler 在 2012 年提出，当时在看到 `"globalSEmax"` 提案(代码)并阅读了 `"Tibs2001SEmax"` 提案后，将 `clusGap()` 添加到 `cluster` 包中。 `"globalSEmax"`： (在 Dudoit 和 Fridlyand (2002) 中使用，据说遵循 Tibshirani 的命题)：第一个 `f()` 值的位置，该值不小于全局最大值减去 `SE.factor * SE.f[]` ，即在“f S.E.”内该最大值的范围(另请参见 `SE.factor` )。请参阅示例以进行简单情况的比较。
`SE.factor`	[当 `method` 包含 `"SE"` 时] 确定最佳簇数，Tibshirani 等人。提出了“1 S.E.”规则。使用 `SE.factor` `f` 时，更普遍地使用“f S.E.”规则。
`type` , `xlab` , `ylab` , `main`	参数与 `plot.default()` 中的含义相同，但默认值不同。
`do.arrows`	逻辑指示是否应通过 `arrows()` 绘制 (1 SE -)“error bars” 。
`arrowArgs`	传递给 `arrows()` 的参数列表；默认值，尤其是 `angle` 和 `code` ，提供与常用误差线匹配的样式。

细节

主要结果<res>$Tab[,"gap"] 当然是来自引导(又称为蒙特卡罗模拟)，因此是随机的，或者等效地，取决于初始随机种子(请参阅set.seed() )。另一方面，根据我们的经验，使用 B = 500 给出了相当精确的结果，使得间隙图在另一次运行后本质上没有变化。

值

clusGap(..) 返回 S3 类 "clusGap" 的对象，本质上是一个包含组件的列表

`Tab`	具有 `K.max` 行和 4 列的矩阵，名为 "logW"、"E.logW"、"gap" 和 "SE.sim"，其中 `gap = E.logW - logW` 和 `SE.sim` 对应于 `gap` 的标准误差，并且 `SE.sim[k]=s_k` ，其中 `s_k := \sqrt{1 + 1/B} sd^(gap_j)` 和 `sd^()` 是模拟 (“bootstrapped”) 间隙值的标准偏差。
`call`	`clusGap(..)` `call` 。
`spaceH0`	`spaceH0` 参数(`match.arg()` 编辑)。
`n`	观察数，即 `nrow(x)` 。
`B`	输入`B`
`FUNcluster`	输入函数`FUNcluster`

例子

### --- maxSE() methods -------------------------------------------
(mets <- eval(formals(maxSE)$method))
fk <- c(2,3,5,4,7,8,5,4)
sk <- c(1,1,2,1,1,3,1,1)/2
## use plot.clusGap():
plot(structure(class="clusGap", list(Tab = cbind(gap=fk, SE.sim=sk))))
## Note that 'firstmax' and 'globalmax' are always at 3 and 6 :
sapply(c(1/4, 1,2,4), function(SEf)
        sapply(mets, function(M) maxSE(fk, sk, method = M, SE.factor = SEf)))

### --- clusGap() -------------------------------------------------
## ridiculously nicely separated clusters in 3 D :
x <- rbind(matrix(rnorm(150,           sd = 0.1), ncol = 3),
           matrix(rnorm(150, mean = 1, sd = 0.1), ncol = 3),
           matrix(rnorm(150, mean = 2, sd = 0.1), ncol = 3),
           matrix(rnorm(150, mean = 3, sd = 0.1), ncol = 3))

## Slightly faster way to use pam (see below)
pam1 <- function(x,k) list(cluster = pam(x,k, cluster.only=TRUE))

## We do not recommend using hier.clustering here, but if you want,
## there is  factoextra::hcut () or a cheap version of it
hclusCut <- function(x, k, d.meth = "euclidean", ...)
   list(cluster = cutree(hclust(dist(x, method=d.meth), ...), k=k))

## You can manually set it before running this :    doExtras <- TRUE  # or  FALSE
if(!(exists("doExtras") && is.logical(doExtras)))
  doExtras <- cluster:::doExtras()

if(doExtras) {
  ## Note we use  B = 60 in the following examples to keep them "speedy".
  ## ---- rather keep the default B = 500 for your analysis!

  ## note we can  pass 'nstart = 20' to kmeans() :
  gskmn <- clusGap(x, FUN = kmeans, nstart = 20, K.max = 8, B = 60)
  gskmn #-> its print() method
  plot(gskmn, main = "clusGap(., FUN = kmeans, n.start=20, B= 60)")
  set.seed(12); system.time(
    gsPam0 <- clusGap(x, FUN = pam, K.max = 8, B = 60)
  )
  set.seed(12); system.time(
    gsPam1 <- clusGap(x, FUN = pam1, K.max = 8, B = 60)
  )
  ## and show that it gives the "same":
  not.eq <- c("call", "FUNcluster"); n <- names(gsPam0)
  eq <- n[!(n %in% not.eq)]
  stopifnot(identical(gsPam1[eq], gsPam0[eq]))
  print(gsPam1, method="globalSEmax")
  print(gsPam1, method="globalmax")

  print(gsHc <- clusGap(x, FUN = hclusCut, K.max = 8, B = 60))

}# end {doExtras}

gs.pam.RU <- clusGap(ruspini, FUN = pam1, K.max = 8, B = 60)
gs.pam.RU
plot(gs.pam.RU, main = "Gap statistic for the 'ruspini' data")
mtext("k = 4 is best .. and  k = 5  pretty close")

## This takes a minute..
## No clustering ==> k = 1 ("one cluster") should be optimal:
Z <- matrix(rnorm(256*3), 256,3)
gsP.Z <- clusGap(Z, FUN = pam1, K.max = 8, B = 200)
plot(gsP.Z, main = "clusGap(<iid_rnorm_p=3>)  ==> k = 1  cluster is optimal")
gsP.Z

作者

This function is originally based on the functions gap of former (Bioconductor) package SAGx by Per Broberg, gapStat() from former package SLmisc by Matthias Kohl and ideas from gap() and its methods of package lga by Justin Harrington.

The current implementation is by Martin Maechler.

The implementation of spaceH0 = "original" is based on code proposed by Juan Gonzalez.

参考

Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411-423.

Tibshirani, R., Walther, G. and Hastie, T. (2000). Estimating the number of clusters in a dataset via the Gap statistic. Technical Report. Stanford.

Dudoit, S. and Fridlyand, J. (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3(7). doi:10.1186/gb-2002-3-7-research0036

Per Broberg (2006). SAGx: Statistical Analysis of the GeneChip. R package version 1.9.7. http://bioconductor.org/packages/3.12/bioc/html/SAGx.html

也可以看看

silhouette 用于更简单、不太复杂的聚类度量。

fpc 包中的 cluster.stats() 用于替代措施。

相关用法

注：本文由纯净天空筛选整理自R-devel大神的英文原创作品 Gap Statistic for Estimating the Number of Clusters。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。