R hclust 層次聚類 - 純淨天空

R語言 hclust 位於 stats 包(package)。

說明

對一組差異性的層次聚類分析及其分析方法。

用法

hclust(d, method = "complete", members = NULL)

## S3 method for class 'hclust'
plot(x, labels = NULL, hang = 0.1, check = TRUE,
     axes = TRUE, frame.plot = FALSE, ann = TRUE,
     main = "Cluster Dendrogram",
     sub = NULL, xlab = NULL, ylab = "Height", ...)

參數

`d`	由 `dist` 生成的相異結構。
`method`	要使用的聚集方法。這應該是 `"ward.D"` 、 `"ward.D2"` 、 `"single"` 、 `"complete"` 、 `"average"` (= UPGMA)、`"mcquitty"` (= WPGMA)、`"median"` (= WPGMC) 之一(的明確縮寫) )或`"centroid"`(= UPGMC)。
`members`	`NULL` 或長度大小為 `d` 的向量。請參閱“詳細信息”部分。
`x`	`hclust` 生成的類型的對象。
`hang`	標簽應懸掛在繪圖其餘部分下方的繪圖高度的分數。負值將導致標簽從 0 開始下垂。
`check`	邏輯表明如果`x`應檢查對象的有效性。當以下情況時不需要此檢查`x`已知是有效的，例如當它是以下的直接結果時`hclust()`.默認是`check=TRUE`，因為無效輸入可能會崩潰R由於內部 C 繪圖代碼中的內存違規。
`labels`	樹葉標簽的字符向量。默認情況下使用原始數據的行名稱或行號。如果 `labels = FALSE` 則根本不繪製任何標簽。
`axes, frame.plot, ann`	邏輯標誌如 `plot.default` 。
`main, sub, xlab, ylab`	`title` 的字符串。當存在 `tree$call` 時，`sub` 和 `xlab` 具有非 NULL 默認值。
`...`	進一步的圖形論證。例如， `cex` 以與 `text` 相同的方式控製標簽的大小(如果繪製)。

細節

此函數使用正在聚類的 n 對象的一組不同點來執行層次聚類分析。最初，每個對象都被分配到自己的簇，然後算法迭代地進行，在每個階段加入兩個最相似的簇，一直持續到隻剩下一個簇為止。在每個階段，根據所使用的特定聚類方法，通過Lance-Williams相異性更新公式重新計算聚類之間的距離。

提供了多種不同的聚類方法。 Ward 的最小方差方法旨在尋找緊湊的球形簇。完全鏈接方法發現相似的簇。單鏈接方法(與最小生成樹密切相關)采用“朋友的朋友”聚類策略。其他方法可以被視為針對具有介於單鏈接方法和完全鏈接方法之間的特征的聚類。但請注意，方法 "median" 和 "centroid" 不會導致單調距離測量，或者等效地，生成的樹狀圖可能具有難以解釋的所謂反轉或反轉，但請注意 Legendre 和 Legendre (2012) 中的三分法。

在 Ward 聚類文獻中發現了兩種不同的算法。選項使用的那個"ward.D"(相當於唯一的病房選項"ward"在R版本\le3.0.3)才不是實施 Ward (1963) 聚類標準，而選項"ward.D2"實施該標準(Murtagh 和 Legendre 2014)。對於後者，不同之處在於平方集群更新之前。注意agnes(*, method="ward")對應於hclust(*, "ward.D2").

如果 members != NULL ，則 d 被視為簇之間的相異矩陣，而不是單例之間的相異矩陣，並且 members 給出每個簇的觀測值數量。這樣，層次聚類算法可以“從樹狀圖的中間開始”，例如，為了重建樹的切口上方的部分(參見示例)。僅對於有限數量的距離/鏈接組合，可以有效地計算簇之間的差異(即，無需 hclust 本身)，最簡單的組合是平方歐幾裏德距離和質心鏈接。在這種情況下，聚類之間的差異是聚類均值之間的歐氏距離的平方。

在分層集群顯示中，每次合並時都需要做出決定，以指定哪個子樹應位於左側，哪個子樹應位於右側。由於對於 n 觀察存在 n-1 合並，因此聚類樹或樹狀圖中的葉子有 2^{(n-1)} 可能的排序。 hclust 中使用的算法是對子樹進行排序，以便更緊密的簇位於左側(左子樹的最後一次(即最近一次)合並的值低於右子樹的最後一次合並的值)。單個觀測值是可能的最緊密的簇，並且涉及兩個觀測值的合並按觀測序列號將它們按順序排列。

值

hclust 類的對象，說明聚類過程生成的樹。該對象是一個包含組件的列表：

`merge`	`n-1` x 2 矩陣。 `merge`的行`i`說明了聚類步驟`i`處的聚類合並。如果該行中的元素 `j` 為負，則在此階段合並觀察值 `-j`。如果 `j` 為正，則合並是在算法的(較早)階段 `j` 形成的簇。因此`merge`中的負條目表示單例的聚集，正條目表示非單例的聚集。
`height`	一組`n-1` 實值(超度量樹非遞減)。聚類高度：即，與特定聚集的聚類`method`相關的標準值。
`order`	給出適合繪圖的原始觀察值的排列的向量，從某種意義上說，使用此排序和矩陣 `merge` 的聚類圖不會有分支交叉。
`labels`	每個被聚類的對象的標簽。
`call`	產生結果的調用。
`method`	已使用的聚類方法。
`dist.method`	用於創建 `d` 的距離(僅當距離對象具有 `"method"` 屬性時才返回)。

hclust 對象有 print 、 plot 和 identify (請參閱 identify.hclust )方法和 rect.hclust() 函數。

注意

方法 "centroid" 通常與平方歐幾裏得距離一起使用。

例子

require(graphics)

### Example 1: Violent crime rates by US state

hc <- hclust(dist(USArrests), "ave")
plot(hc)
plot(hc, hang = -1)

## Do the same with centroid clustering and *squared* Euclidean distance,
## cut the tree into ten clusters and reconstruct the upper part of the
## tree from the cluster centers.
hc <- hclust(dist(USArrests)^2, "cen")
memb <- cutree(hc, k = 10)
cent <- NULL
for(k in 1:10){
  cent <- rbind(cent, colMeans(USArrests[memb == k, , drop = FALSE]))
}
hc1 <- hclust(dist(cent)^2, method = "cen", members = table(memb))
opar <- par(mfrow = c(1, 2))
plot(hc,  labels = FALSE, hang = -1, main = "Original Tree")
plot(hc1, labels = FALSE, hang = -1, main = "Re-start from 10 clusters")
par(opar)

### Example 2: Straight-line distances among 10 US cities
##  Compare the results of algorithms "ward.D" and "ward.D2"

mds2 <- -cmdscale(UScitiesD)
plot(mds2, type="n", axes=FALSE, ann=FALSE)
text(mds2, labels=rownames(mds2), xpd = NA)

hcity.D  <- hclust(UScitiesD, "ward.D") # "wrong"
hcity.D2 <- hclust(UScitiesD, "ward.D2")
opar <- par(mfrow = c(1, 2))
plot(hcity.D,  hang=-1)
plot(hcity.D2, hang=-1)
par(opar)

作者

The hclust function is based on Fortran code contributed to STATLIB by F. Murtagh.

參考

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). The New S Language. Wadsworth & Brooks/Cole. (S version.)

Everitt, B. (1974). Cluster Analysis. London: Heinemann Educ. Books.

Hartigan, J.A. (1975). Clustering Algorithms. New York: Wiley.

Sneath, P. H. A. and R. R. Sokal (1973). Numerical Taxonomy. San Francisco: Freeman.

Anderberg, M. R. (1973). Cluster Analysis for Applications. Academic Press: New York.

Gordon, A. D. (1999). Classification. Second Edition. London: Chapman and Hall / CRC

Murtagh, F. (1985). “Multidimensional Clustering Algorithms”, in COMPSTAT Lectures 4. Wuerzburg: Physica-Verlag (for algorithmic details of algorithms used).

McQuitty, L.L. (1966). Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data. Educational and Psychological Measurement, 26, 825-831. doi:10.1177/001316446602600402.

Legendre, P. and L. Legendre (2012). Numerical Ecology, 3rd English ed. Amsterdam: Elsevier Science BV.

Murtagh, Fionn and Legendre, Pierre (2014). Ward's hierarchical agglomerative clustering method: which algorithms implement Ward's criterion? Journal of Classification, 31, 274-295. doi:10.1007/s00357-014-9161-z.

也可以看看

identify.hclust、rect.hclust、cutree、dendrogram、kmeans。

有關 Lance-Williams 公式和一般應用它的方法，請參閱 cluster 包中的 agnes 。

相關用法

注：本文由純淨天空篩選整理自R-devel大神的英文原創作品 Hierarchical Clustering。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。