Python SciPy hierarchy.linkage用法及代碼示例

本文簡要介紹 python 語言中 scipy.cluster.hierarchy.linkage 的用法。

用法: scipy.cluster.hierarchy.linkage(y, method='single', metric='euclidean', optimal_ordering=False)#

執行分層/凝聚聚類。

輸入 y 可以是一維壓縮距離矩陣或二維觀察向量數組。

如果 y 是一維壓縮距離矩陣，則 y 必須是 \(\binom{n}{2}\) 大小的向量，其中 n 是在距離矩陣中配對的原始觀測值的數量。此函數的行為與 MATLAB 鏈接函數非常相似。

返回一個\((n-1)\) x 4 矩陣Z。在 \(i\) -th 迭代中，具有索引 Z[i, 0] 和 Z[i, 1] 的集群組合形成集群 \(n + i\) 。索引小於 \(n\) 的集群對應於 \(n\) 原始觀測值之一。簇 Z[i, 0] 和 Z[i, 1] 之間的距離由 Z[i, 2] 給出。第四個值Z[i, 3] 表示新形成的聚類中原始觀測值的數量。

以下鏈接方法用於計算兩個簇 \(s\) 和 \(t\) 之間的距離 \(d(s, t)\)。該算法從尚未在正在形成的層次結構中使用的集群森林開始。當來自該林的兩個集群 \(s\) 和 \(t\) 組合成一個集群時， \(u\) 、 \(s\) 和 \(t\) 將從森林中刪除，並將 \(u\) 添加到森林中。當森林中隻剩下一個簇時，算法停止，這個簇成為根。

每次迭代都維護一個距離矩陣。 d[i,j] 條目對應於原始森林中簇\(i\) 和\(j\) 之間的距離。

在每次迭代中，算法必須更新距離矩陣以反映新形成的集群 u 與森林中剩餘集群的距離。

假設在集群 \(u\) 中有 \(|u|\) 原始觀測值 \(u[0], \ldots, u[|u|-1]\) ，在集群 \(v\) 中有 \(|v|\) 原始對象 \(v[0], \ldots, v[|v|-1]\) 。回想一下，\(s\) 和 \(t\) 組合在一起形成集群 \(u\) 。讓 \(v\) 是林中不是 \(u\) 的任何剩餘集群。

以下是計算新形成的簇 \(u\) 和每個 \(v\) 之間距離的方法。

method=’single’ assigns

\[d(u,v) = \min(dist(u[i],v[j]))\]

for all points \(i\) in cluster \(u\) and \(j\) in cluster \(v\). This is also known as the Nearest Point Algorithm.

method=’complete’ assigns

\[d(u, v) = \max(dist(u[i],v[j]))\]

for all points \(i\) in cluster u and \(j\) in cluster \(v\). This is also known by the Farthest Point Algorithm or Voor Hees Algorithm.

method=’average’ assigns

\[d(u,v) = \sum_{ij} \frac{d(u[i], v[j])} {(|u|*|v|)}\]

for all points \(i\) and \(j\) where \(|u|\) and \(|v|\) are the cardinalities of clusters \(u\) and \(v\), respectively. This is also called the UPGMA algorithm.

method=’weighted’ assigns

\[d(u,v) = (dist(s,v) + dist(t,v))/2\]

where cluster u was formed with cluster s and t and v is a remaining cluster in the forest (also called WPGMA).

method=’centroid’ assigns

\[dist(s,t) = ||c_s-c_t||_2\]

where \(c_s\) and \(c_t\) are the centroids of clusters \(s\) and \(t\), respectively. When two clusters \(s\) and \(t\) are combined into a new cluster \(u\), the new centroid is computed over all the original objects in clusters \(s\) and \(t\). The distance then becomes the Euclidean distance between the centroid of \(u\) and the centroid of a remaining cluster \(v\) in the forest. This is also known as the UPGMC algorithm.

method=’median’ assigns \(d(s,t)\) like the centroid method. When two clusters \(s\) and \(t\) are combined into a new cluster \(u\), the average of centroids s and t give the new centroid \(u\). This is also known as the WPGMC algorithm.

method=’ward’ uses the Ward variance minimization algorithm. The new entry \(d(u,v)\) is computed as follows,

\[d(u,v) = \sqrt{\frac{|v|+|s|} {T}d(v,s)^2 + \frac{|v|+|t|} {T}d(v,t)^2 - \frac{|v|} {T}d(s,t)^2}\]

where \(u\) is the newly joined cluster consisting of clusters \(s\) and \(t\), \(v\) is an unused cluster in the forest, \(T=|v|+|s|+|t|\), and \(|*|\) is the cardinality of its argument. This is also known as the incremental algorithm.

警告：選擇森林中的最小距離對時，可能有兩個或更多對具有相同的最小距離。此實現可以選擇與 MATLAB 版本不同的最小值。

參數：：

y： ndarray: 一個壓縮的距離矩陣。壓縮距離矩陣是包含距離矩陣的上三角形的平麵陣列。這是pdist 返回的形式。或者，\(n\) 維度中的\(m\) 觀察向量的集合可以作為\(m\) 由\(n\) 數組傳遞。壓縮距離矩陣的所有元素都必須是有限的，即沒有 NaNs 或 infs。
method： str，可選: 要使用的鏈接算法。有關完整說明，請參閱下麵的 Linkage Methods 部分。
metric： str 或函數，可選: 在 y 是觀察向量集合的情況下使用的距離度量；否則忽略。有關有效距離度量的列表，請參閱pdist 函數。也可以使用自定義距離函數。
optimal_ordering：布爾型，可選: 如果為 True，則鏈接矩陣將重新排序，以便連續葉子之間的距離最小。當數據可視化時，這會產生更直觀的樹結構。默認為 False，因為該算法可能很慢，特別是在大型數據集上 [2]。另請參見 optimal_leaf_ordering 函數。

Z： ndarray: 層次聚類編碼為鏈接矩陣。

注意：

對於方法‘single’，實現了基於最小生成樹的優化算法。它的時間複雜度為\(O(n^2)\)。對於方法‘complete’, ‘average’, ‘weighted’和‘ward’，實現了一種稱為nearest-neighbors鏈的算法。它還具有時間複雜度\(O(n^2)\)。對於其他方法，以 \(O(n^3)\) 時間複雜度實現樸素算法。所有算法都使用\(O(n^2)\) 內存。有關算法的詳細信息，請參閱[1]。
僅當使用歐幾裏德成對度量時，方法 ‘centroid’, ‘median’ 和 ‘ward’ 才能正確定義。如果 y 作為預先計算的成對距離傳遞，則用戶有責任確保這些距離實際上是歐幾裏得距離，否則生成的結果將不正確。

參考：

[1]

Daniel Mullner，“現代層次、凝聚聚類算法”，arXiv:1109.2378v1。

[2]

Ziv Bar-Joseph、David K. Gifford、Tommi S. Jaakkola，“層次聚類的快速最優葉排序”，2001 年。生物信息學 DOI:10.1093/bioinformatics/17.suppl_1.S22

例子：

>>> from scipy.cluster.hierarchy import dendrogram, linkage
>>> from matplotlib import pyplot as plt
>>> X = [[i] for i in [2, 8, 0, 4, 1, 9, 9, 0]]

>>> Z = linkage(X, 'ward')
>>> fig = plt.figure(figsize=(25, 10))
>>> dn = dendrogram(Z)

>>> Z = linkage(X, 'single')
>>> fig = plt.figure(figsize=(25, 10))
>>> dn = dendrogram(Z)
>>> plt.show()

相關用法

注：本文由純淨天空篩選整理自scipy.org大神的英文原創作品 scipy.cluster.hierarchy.linkage。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。

用法:

參數 ：：

返回 ：：

注意：

參考：

例子：

參數：：

返回：：