R silhouette 从聚类中计算或提取轮廓信息

R语言 silhouette 位于 cluster 包(package)。

说明

根据k 集群中的给定集群计算轮廓信息。

用法

silhouette(x, ...)
## Default S3 method:
  silhouette(x, dist, dmatrix, ...)
## S3 method for class 'partition'
silhouette(x, ...)
## S3 method for class 'clara'
silhouette(x, full = FALSE, subset = NULL, ...)

sortSilhouette(object, ...)
## S3 method for class 'silhouette'
summary(object, FUN = mean, ...)
## S3 method for class 'silhouette'
plot(x, nmax.lab = 40, max.strlen = 5,
     main = NULL, sub = NULL, xlab = expression("Silhouette width "* s[i]),
     col = "gray",  do.col.sort = length(col) > 1, border = 0,
     cex.names = par("cex.axis"), do.n.k = TRUE, do.clus.stat = TRUE, ...)

参数

`x`	适当类别的对象；对于`default`方法，具有`k`不同整数簇代码的整数向量或具有此类`x$clustering`分量的列表。请注意，轮廓统计仅在 `2 \le k \le n-1` 时定义。
`dist`	从类 `dist` 继承或可强制转换为一的相异对象。如果未指定，则必须指定`dmatrix`。
`dmatrix`	指定对称相异矩阵 ( `n \times n` )，而不是 `dist` ，这样效率更高。
`full`	`[0,1]` 中的逻辑或数字指定是否应为 `clara` 对象计算完整轮廓。当数字为 `f` 时，对于数据的随机 `sample.int(n, size = fn)` ，将计算轮廓值。这需要 `O((fn)^2)` 内存，因为内部需要(子)样本(参见 `daisy` )的完全不同性。
`subset`	`1:n` 的子集，指定而不是 `full` 来指定用于轮廓计算的观测值的索引。
`object`	类 `silhouette` 的对象。
`...`	传入和传出方法的进一步参数。
`FUN`	用于总结轮廓宽度的函数。
`nmax.lab`	整数，表示标签的数量，对于单个名称标记轮廓图来说被认为太大。
`max.strlen`	正整数，给出轮廓图标签中字符串被截断的长度。
`main` , `sub` , `xlab`	`title` 的参数；这里有一个合理的非 NULL 默认值。
`col` , `border` , `cex.names`	论点已通过`barplot()`;请注意，默认值曾经是`col = heat.colors(n), border = par("fg")`反而。 `col`也可以是长度的颜色向量`k`对于聚类着色，另请参阅`do.col.sort`：
`do.col.sort`	逻辑指示颜色`col`是否应该对轮廓“along”进行排序；这对于按大小写或按簇着色很有用。
`do.n.k`	逻辑指示是否应写入 `n` 和 `k` “title text”。
`do.clus.stat`	逻辑指示簇大小和平均值是否应直接写入轮廓。

细节

对于每个观察 i，轮廓宽度 s(i)定义如下：
将 a(i) = i 与 i 所属簇的所有其他点之间的平均相异度(如果 i 是只要观察其簇，s(i) := 0无需进一步计算)。对全部其他集群 C，放置d(i,C)= i 与 C 的所有观测值的平均差异。其中最小的d(i,C)是b(i) := \min_C d(i,C)，可以看作是 i 和它的 “neighbor” 簇之间的差异，即它最接近的簇不是属于。最后，

silhouette.default() 现在基于 Romain Francois 捐赠的 C 代码(R 版本仍以 cluster:::silhouette.default.R 形式提供)。

具有较大 s(i)(接近 1)的观测值聚类得很好，较小的 s(i)(大约 0)意味着观测值位于两个簇之间，而具有负值 s(i) 的观测值可能被放置在错误的簇中。

值

silhouette() 返回 silhouette 类的对象 sil ，该对象是具有属性的 n \times 3 矩阵。对于每个观测值 i，sil[i,] 包含 i 所属的簇以及 i 的邻居簇(不包含 i 的簇，其观测值与 i 之间的平均差异最小)，以及观测值的轮廓宽度观察的s(i)。 colnames 对应的是 c("cluster", "neighbor", "sil_width") 。

summary(sil) 返回 summary.silhouette 类的对象，一个包含组件的列表

si.summary：: 各个轮廓宽度 s(i) 的数值 summary 。
clus.avg.widths：: 轮廓宽度的聚类平均值的数字(等级 1)数组，其中使用 mean = FUN。
avg.width：: 总平均值FUN(s)，其中s 是各个轮廓宽度。
clus.sizes：: table 的 k 簇大小。
call：: 如果可用，call 创建 sil 。
Ordered：: 逻辑上与 attr(sil, "Ordered") 相同，见下文。

sortSilhouette(sil)对行进行排序sil如轮廓图中所示，按簇(逐渐)和减小轮廓宽度s(i).
attr(sil, "Ordered")是一个逻辑表明如果sil 是排序者为sortSilhouette()。在这种情况下，rownames(sil)将包含案例标签或编号，并且
attr(sil, "iOrd")排序索引向量。

注意

虽然 silhouette() 是 partition 聚类所固有的，因此有一个(简单的)方法来实现这些，但使用 cutree() 和距离作为输入，可以直接从 silhouette.default() 的层次聚类中获取轮廓。

默认情况下，对于 clara() 分区，轮廓仅适用于使用的最佳随机子集。使用full = TRUE 计算(并稍后可能绘制)完整轮廓。

例子

data(ruspini)
pr4 <- pam(ruspini, 4)
str(si <- silhouette(pr4))
(ssi <- summary(si))
plot(si) # silhouette plot
plot(si, col = c("red", "green", "blue", "purple"))# with cluster-wise coloring

si2 <- silhouette(pr4$clustering, dist(ruspini, "canberra"))
summary(si2) # has small values: "canberra"'s fault
plot(si2, nmax= 80, cex.names=0.6)

op <- par(mfrow= c(3,2), oma= c(0,0, 3, 0),
          mgp= c(1.6,.8,0), mar= .1+c(4,2,2,2))
for(k in 2:6)
   plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE)
mtext("PAM(Ruspini) as in Kaufman & Rousseeuw, p.101",
      outer = TRUE, font = par("font.main"), cex = par("cex.main")); frame()

## the same with cluster-wise colours:
c6 <- c("tomato", "forest green", "dark blue", "purple2", "goldenrod4", "gray20")
for(k in 2:6)
   plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE,
        col = c6[1:k])
par(op)

## clara(): standard silhouette is just for the best random subset
data(xclara)
set.seed(7)
str(xc1k <- xclara[ sample(nrow(xclara), size = 1000) ,]) # rownames == indices
cl3 <- clara(xc1k, 3)
plot(silhouette(cl3))# only of the "best" subset of 46
## The full silhouette: internally needs large (36 MB) dist object:
sf <- silhouette(cl3, full = TRUE) ## this is the same as
s.full <- silhouette(cl3$clustering, daisy(xc1k))
stopifnot(all.equal(sf, s.full, check.attributes = FALSE, tolerance = 0))
## color dependent on original "3 groups of each 1000": % __FIXME ??__
plot(sf, col = 2+ as.integer(names(cl3$clustering) ) %/% 1000,
     main ="plot(silhouette(clara(.), full = TRUE))")

## Silhouette for a hierarchical clustering:
ar <- agnes(ruspini)
si3 <- silhouette(cutree(ar, k = 5), # k = 4 gave the same as pam() above
    	           daisy(ruspini))
stopifnot(is.data.frame(di3 <- as.data.frame(si3))) 
plot(si3, nmax = 80, cex.names = 0.5)
## 2 groups: Agnes() wasn't too good:
si4 <- silhouette(cutree(ar, k = 2), daisy(ruspini))
plot(si4, nmax = 80, cex.names = 0.5)

参考

Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20, 53-65.

chapter 2 of Kaufman and Rousseeuw (1990), see the references in plot.agnes.

也可以看看

partition.object、plot.partition。

相关用法

注：本文由纯净天空筛选整理自R-devel大神的英文原创作品 Compute or Extract Silhouette Information from Clustering。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。