clusGap
位於 cluster
包(package)。 說明
clusGap()
計算聚類度量的優度,即 “gap” 統計量。對於每個數量的簇 ,它將 與 進行比較,其中後者是通過引導定義的,即從參考 ( ) 分布進行模擬,超立方體上的均勻分布由以下範圍確定x
,首先居中,然後 svd
(又名“PCA”) - 在(默認情況下) spaceH0 = "scaledPCA"
時旋轉它們。
maxSE(f, SE.f)
確定 f
最大值的位置,並考慮 *SE*
方法的“1-SE 規則”。默認方法 "firstSEmax"
查找最小的 ,使其值 與第一個局部最大值相差不超過 1 個標準誤差。這與 "Tibs2001SEmax"
類似但不相同,Tibshirani 等人建議根據間隙統計數據及其標準差確定簇的數量。
用法
clusGap(x, FUNcluster, K.max, B = 100, d.power = 1,
spaceH0 = c("scaledPCA", "original"),
verbose = interactive(), ...)
maxSE(f, SE.f,
method = c("firstSEmax", "Tibs2001SEmax", "globalSEmax",
"firstmax", "globalmax"),
SE.factor = 1)
## S3 method for class 'clusGap'
print(x, method = "firstSEmax", SE.factor = 1, ...)
## S3 method for class 'clusGap'
plot(x, type = "b", xlab = "k", ylab = expression(Gap[k]),
main = NULL, do.arrows = TRUE,
arrowArgs = list(col="red3", length=1/16, angle=90, code=3), ...)
參數
x |
數字矩陣或 |
FUNcluster |
|
K.max |
要考慮的最大簇數必須至少為兩個。 |
B |
整數,蒙特卡羅 (“bootstrap”) 樣本數。 |
d.power |
指定冪的正整數 |
spaceH0 |
|
verbose |
整數或邏輯,確定是否應打印 “progress” 輸出。默認情況下,每個引導樣本打印一位。 |
... |
(對於 |
f |
“函數值”的數字向量,長度為 ,我們想要其最大值(“1 SE尊重”)。 |
SE.f |
長度為 |
method |
字符串,指示如何根據間隙統計數據(及其標準差)計算 “optimal” 簇數 ,或者更一般地說,應如何確定 最大值的位置 。
請參閱示例以進行簡單情況的比較。 |
SE.factor |
[當 |
type , xlab , ylab , main |
參數與 |
do.arrows |
邏輯指示是否應通過 |
arrowArgs |
傳遞給 |
細節
主要結果<res>$Tab[,"gap"]
當然是來自引導(又稱為蒙特卡羅模擬),因此是隨機的,或者等效地,取決於初始隨機種子(請參閱set.seed()
)。另一方麵,根據我們的經驗,使用 B = 500
給出了相當精確的結果,使得間隙圖在另一次運行後本質上沒有變化。
值
clusGap(..)
返回 S3 類 "clusGap"
的對象,本質上是一個包含組件的列表
Tab |
具有 |
call |
|
spaceH0 |
|
n |
觀察數,即 |
B |
輸入 |
FUNcluster |
輸入函數 |
例子
### --- maxSE() methods -------------------------------------------
(mets <- eval(formals(maxSE)$method))
fk <- c(2,3,5,4,7,8,5,4)
sk <- c(1,1,2,1,1,3,1,1)/2
## use plot.clusGap():
plot(structure(class="clusGap", list(Tab = cbind(gap=fk, SE.sim=sk))))
## Note that 'firstmax' and 'globalmax' are always at 3 and 6 :
sapply(c(1/4, 1,2,4), function(SEf)
sapply(mets, function(M) maxSE(fk, sk, method = M, SE.factor = SEf)))
### --- clusGap() -------------------------------------------------
## ridiculously nicely separated clusters in 3 D :
x <- rbind(matrix(rnorm(150, sd = 0.1), ncol = 3),
matrix(rnorm(150, mean = 1, sd = 0.1), ncol = 3),
matrix(rnorm(150, mean = 2, sd = 0.1), ncol = 3),
matrix(rnorm(150, mean = 3, sd = 0.1), ncol = 3))
## Slightly faster way to use pam (see below)
pam1 <- function(x,k) list(cluster = pam(x,k, cluster.only=TRUE))
## We do not recommend using hier.clustering here, but if you want,
## there is factoextra::hcut () or a cheap version of it
hclusCut <- function(x, k, d.meth = "euclidean", ...)
list(cluster = cutree(hclust(dist(x, method=d.meth), ...), k=k))
## You can manually set it before running this : doExtras <- TRUE # or FALSE
if(!(exists("doExtras") && is.logical(doExtras)))
doExtras <- cluster:::doExtras()
if(doExtras) {
## Note we use B = 60 in the following examples to keep them "speedy".
## ---- rather keep the default B = 500 for your analysis!
## note we can pass 'nstart = 20' to kmeans() :
gskmn <- clusGap(x, FUN = kmeans, nstart = 20, K.max = 8, B = 60)
gskmn #-> its print() method
plot(gskmn, main = "clusGap(., FUN = kmeans, n.start=20, B= 60)")
set.seed(12); system.time(
gsPam0 <- clusGap(x, FUN = pam, K.max = 8, B = 60)
)
set.seed(12); system.time(
gsPam1 <- clusGap(x, FUN = pam1, K.max = 8, B = 60)
)
## and show that it gives the "same":
not.eq <- c("call", "FUNcluster"); n <- names(gsPam0)
eq <- n[!(n %in% not.eq)]
stopifnot(identical(gsPam1[eq], gsPam0[eq]))
print(gsPam1, method="globalSEmax")
print(gsPam1, method="globalmax")
print(gsHc <- clusGap(x, FUN = hclusCut, K.max = 8, B = 60))
}# end {doExtras}
gs.pam.RU <- clusGap(ruspini, FUN = pam1, K.max = 8, B = 60)
gs.pam.RU
plot(gs.pam.RU, main = "Gap statistic for the 'ruspini' data")
mtext("k = 4 is best .. and k = 5 pretty close")
## This takes a minute..
## No clustering ==> k = 1 ("one cluster") should be optimal:
Z <- matrix(rnorm(256*3), 256,3)
gsP.Z <- clusGap(Z, FUN = pam1, K.max = 8, B = 200)
plot(gsP.Z, main = "clusGap(<iid_rnorm_p=3>) ==> k = 1 cluster is optimal")
gsP.Z
作者
This function is originally based on the functions gap
of
former (Bioconductor) package SAGx
by Per Broberg,
gapStat()
from former package SLmisc
by Matthias Kohl
and ideas from gap()
and its methods of package lga
by
Justin Harrington.
The current implementation is by Martin Maechler.
The implementation of spaceH0 = "original"
is based on code
proposed by Juan Gonzalez.
參考
Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411-423.
Tibshirani, R., Walther, G. and Hastie, T. (2000). Estimating the number of clusters in a dataset via the Gap statistic. Technical Report. Stanford.
Dudoit, S. and Fridlyand, J. (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3(7). doi:10.1186/gb-2002-3-7-research0036
Per Broberg (2006). SAGx: Statistical Analysis of the GeneChip. R package version 1.9.7. http://bioconductor.org/packages/3.12/bioc/html/SAGx.html
也可以看看
silhouette
用於更簡單、不太複雜的聚類度量。
fpc
包中的 cluster.stats()
用於替代措施。
相關用法
- R clusplot (分區對象的)雙變量聚類圖
- R clusplot.default 雙變量聚類圖 (clusplot) 默認方法
- R clara 集群大型應用程序
- R chorSub Kola 數據 C 範圍的子集
- R coef.hclust “hclust”對象的凝聚/分裂係數
- R summary.clara “clara”對象的摘要方法
- R diana 分裂分析聚類
- R pluton 鈈同位素成分批次
- R votes.repub 總統選舉中共和黨候選人的投票
- R agnes 凝聚嵌套(層次聚類)
- R print.mona MONA 對象的打印方法
- R print.clara CLARA 對象的打印方法
- R mona 二元變量的單論分析聚類
- R plot.diana 分裂層次聚類圖
- R plot.mona 一元分裂層次聚類的旗幟
- R bannerplot 繪圖橫幅(層次聚類)
- R plot.partition 數據集分區圖
- R summary.agnes “agnes”對象的摘要方法
- R pltree 繪製層次聚類的聚類樹
- R summary.mona “mona”對象的摘要方法
- R plantTraits 植物物種性狀數據
- R plot.agnes 凝聚層次聚類圖
- R print.agnes AGNES 對象的打印方法
- R pam 圍繞 Medoid 進行分區
- R volume.ellipsoid 計算(橢球體的)體積
注:本文由純淨天空篩選整理自R-devel大神的英文原創作品 Gap Statistic for Estimating the Number of Clusters。非經特殊聲明,原始代碼版權歸原作者所有,本譯文未經允許或授權,請勿轉載或複製。