clusGap
位于 cluster
包(package)。 说明
clusGap()
计算聚类度量的优度,即 “gap” 统计量。对于每个数量的簇 ,它将 与 进行比较,其中后者是通过引导定义的,即从参考 ( ) 分布进行模拟,超立方体上的均匀分布由以下范围确定x
,首先居中,然后 svd
(又名“PCA”) - 在(默认情况下) spaceH0 = "scaledPCA"
时旋转它们。
maxSE(f, SE.f)
确定 f
最大值的位置,并考虑 *SE*
方法的“1-SE 规则”。默认方法 "firstSEmax"
查找最小的 ,使其值 与第一个局部最大值相差不超过 1 个标准误差。这与 "Tibs2001SEmax"
类似但不相同,Tibshirani 等人建议根据间隙统计数据及其标准差确定簇的数量。
用法
clusGap(x, FUNcluster, K.max, B = 100, d.power = 1,
spaceH0 = c("scaledPCA", "original"),
verbose = interactive(), ...)
maxSE(f, SE.f,
method = c("firstSEmax", "Tibs2001SEmax", "globalSEmax",
"firstmax", "globalmax"),
SE.factor = 1)
## S3 method for class 'clusGap'
print(x, method = "firstSEmax", SE.factor = 1, ...)
## S3 method for class 'clusGap'
plot(x, type = "b", xlab = "k", ylab = expression(Gap[k]),
main = NULL, do.arrows = TRUE,
arrowArgs = list(col="red3", length=1/16, angle=90, code=3), ...)
参数
x |
数字矩阵或 |
FUNcluster |
|
K.max |
要考虑的最大簇数必须至少为两个。 |
B |
整数,蒙特卡罗 (“bootstrap”) 样本数。 |
d.power |
指定幂的正整数 |
spaceH0 |
|
verbose |
整数或逻辑,确定是否应打印 “progress” 输出。默认情况下,每个引导样本打印一位。 |
... |
(对于 |
f |
“函数值”的数字向量,长度为 ,我们想要其最大值(“1 SE尊重”)。 |
SE.f |
长度为 |
method |
字符串,指示如何根据间隙统计数据(及其标准差)计算 “optimal” 簇数 ,或者更一般地说,应如何确定 最大值的位置 。
请参阅示例以进行简单情况的比较。 |
SE.factor |
[当 |
type , xlab , ylab , main |
参数与 |
do.arrows |
逻辑指示是否应通过 |
arrowArgs |
传递给 |
细节
主要结果<res>$Tab[,"gap"]
当然是来自引导(又称为蒙特卡罗模拟),因此是随机的,或者等效地,取决于初始随机种子(请参阅set.seed()
)。另一方面,根据我们的经验,使用 B = 500
给出了相当精确的结果,使得间隙图在另一次运行后本质上没有变化。
值
clusGap(..)
返回 S3 类 "clusGap"
的对象,本质上是一个包含组件的列表
Tab |
具有 |
call |
|
spaceH0 |
|
n |
观察数,即 |
B |
输入 |
FUNcluster |
输入函数 |
例子
### --- maxSE() methods -------------------------------------------
(mets <- eval(formals(maxSE)$method))
fk <- c(2,3,5,4,7,8,5,4)
sk <- c(1,1,2,1,1,3,1,1)/2
## use plot.clusGap():
plot(structure(class="clusGap", list(Tab = cbind(gap=fk, SE.sim=sk))))
## Note that 'firstmax' and 'globalmax' are always at 3 and 6 :
sapply(c(1/4, 1,2,4), function(SEf)
sapply(mets, function(M) maxSE(fk, sk, method = M, SE.factor = SEf)))
### --- clusGap() -------------------------------------------------
## ridiculously nicely separated clusters in 3 D :
x <- rbind(matrix(rnorm(150, sd = 0.1), ncol = 3),
matrix(rnorm(150, mean = 1, sd = 0.1), ncol = 3),
matrix(rnorm(150, mean = 2, sd = 0.1), ncol = 3),
matrix(rnorm(150, mean = 3, sd = 0.1), ncol = 3))
## Slightly faster way to use pam (see below)
pam1 <- function(x,k) list(cluster = pam(x,k, cluster.only=TRUE))
## We do not recommend using hier.clustering here, but if you want,
## there is factoextra::hcut () or a cheap version of it
hclusCut <- function(x, k, d.meth = "euclidean", ...)
list(cluster = cutree(hclust(dist(x, method=d.meth), ...), k=k))
## You can manually set it before running this : doExtras <- TRUE # or FALSE
if(!(exists("doExtras") && is.logical(doExtras)))
doExtras <- cluster:::doExtras()
if(doExtras) {
## Note we use B = 60 in the following examples to keep them "speedy".
## ---- rather keep the default B = 500 for your analysis!
## note we can pass 'nstart = 20' to kmeans() :
gskmn <- clusGap(x, FUN = kmeans, nstart = 20, K.max = 8, B = 60)
gskmn #-> its print() method
plot(gskmn, main = "clusGap(., FUN = kmeans, n.start=20, B= 60)")
set.seed(12); system.time(
gsPam0 <- clusGap(x, FUN = pam, K.max = 8, B = 60)
)
set.seed(12); system.time(
gsPam1 <- clusGap(x, FUN = pam1, K.max = 8, B = 60)
)
## and show that it gives the "same":
not.eq <- c("call", "FUNcluster"); n <- names(gsPam0)
eq <- n[!(n %in% not.eq)]
stopifnot(identical(gsPam1[eq], gsPam0[eq]))
print(gsPam1, method="globalSEmax")
print(gsPam1, method="globalmax")
print(gsHc <- clusGap(x, FUN = hclusCut, K.max = 8, B = 60))
}# end {doExtras}
gs.pam.RU <- clusGap(ruspini, FUN = pam1, K.max = 8, B = 60)
gs.pam.RU
plot(gs.pam.RU, main = "Gap statistic for the 'ruspini' data")
mtext("k = 4 is best .. and k = 5 pretty close")
## This takes a minute..
## No clustering ==> k = 1 ("one cluster") should be optimal:
Z <- matrix(rnorm(256*3), 256,3)
gsP.Z <- clusGap(Z, FUN = pam1, K.max = 8, B = 200)
plot(gsP.Z, main = "clusGap(<iid_rnorm_p=3>) ==> k = 1 cluster is optimal")
gsP.Z
作者
This function is originally based on the functions gap
of
former (Bioconductor) package SAGx
by Per Broberg,
gapStat()
from former package SLmisc
by Matthias Kohl
and ideas from gap()
and its methods of package lga
by
Justin Harrington.
The current implementation is by Martin Maechler.
The implementation of spaceH0 = "original"
is based on code
proposed by Juan Gonzalez.
参考
Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411-423.
Tibshirani, R., Walther, G. and Hastie, T. (2000). Estimating the number of clusters in a dataset via the Gap statistic. Technical Report. Stanford.
Dudoit, S. and Fridlyand, J. (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3(7). doi:10.1186/gb-2002-3-7-research0036
Per Broberg (2006). SAGx: Statistical Analysis of the GeneChip. R package version 1.9.7. http://bioconductor.org/packages/3.12/bioc/html/SAGx.html
也可以看看
silhouette
用于更简单、不太复杂的聚类度量。
fpc
包中的 cluster.stats()
用于替代措施。
相关用法
- R clusplot (分区对象的)双变量聚类图
- R clusplot.default 双变量聚类图 (clusplot) 默认方法
- R clara 集群大型应用程序
- R chorSub Kola 数据 C 范围的子集
- R coef.hclust “hclust”对象的凝聚/分裂系数
- R summary.clara “clara”对象的摘要方法
- R diana 分裂分析聚类
- R pluton 钚同位素成分批次
- R votes.repub 总统选举中共和党候选人的投票
- R agnes 凝聚嵌套(层次聚类)
- R print.mona MONA 对象的打印方法
- R print.clara CLARA 对象的打印方法
- R mona 二元变量的单论分析聚类
- R plot.diana 分裂层次聚类图
- R plot.mona 一元分裂层次聚类的旗帜
- R bannerplot 绘图横幅(层次聚类)
- R plot.partition 数据集分区图
- R summary.agnes “agnes”对象的摘要方法
- R pltree 绘制层次聚类的聚类树
- R summary.mona “mona”对象的摘要方法
- R plantTraits 植物物种性状数据
- R plot.agnes 凝聚层次聚类图
- R print.agnes AGNES 对象的打印方法
- R pam 围绕 Medoid 进行分区
- R volume.ellipsoid 计算(椭球体的)体积
注:本文由纯净天空筛选整理自R-devel大神的英文原创作品 Gap Statistic for Estimating the Number of Clusters。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。