Python SciPy hierarchy.linkage用法及代码示例

本文简要介绍 python 语言中 scipy.cluster.hierarchy.linkage 的用法。

用法: scipy.cluster.hierarchy.linkage(y, method='single', metric='euclidean', optimal_ordering=False)#

执行分层/凝聚聚类。

输入 y 可以是一维压缩距离矩阵或二维观察向量数组。

如果 y 是一维压缩距离矩阵，则 y 必须是 \(\binom{n}{2}\) 大小的向量，其中 n 是在距离矩阵中配对的原始观测值的数量。此函数的行为与 MATLAB 链接函数非常相似。

返回一个\((n-1)\) x 4 矩阵Z。在 \(i\) -th 迭代中，具有索引 Z[i, 0] 和 Z[i, 1] 的集群组合形成集群 \(n + i\) 。索引小于 \(n\) 的集群对应于 \(n\) 原始观测值之一。簇 Z[i, 0] 和 Z[i, 1] 之间的距离由 Z[i, 2] 给出。第四个值Z[i, 3] 表示新形成的聚类中原始观测值的数量。

以下链接方法用于计算两个簇 \(s\) 和 \(t\) 之间的距离 \(d(s, t)\)。该算法从尚未在正在形成的层次结构中使用的集群森林开始。当来自该林的两个集群 \(s\) 和 \(t\) 组合成一个集群时， \(u\) 、 \(s\) 和 \(t\) 将从森林中删除，并将 \(u\) 添加到森林中。当森林中只剩下一个簇时，算法停止，这个簇成为根。

每次迭代都维护一个距离矩阵。 d[i,j] 条目对应于原始森林中簇\(i\) 和\(j\) 之间的距离。

在每次迭代中，算法必须更新距离矩阵以反映新形成的集群 u 与森林中剩余集群的距离。

假设在集群 \(u\) 中有 \(|u|\) 原始观测值 \(u[0], \ldots, u[|u|-1]\) ，在集群 \(v\) 中有 \(|v|\) 原始对象 \(v[0], \ldots, v[|v|-1]\) 。回想一下，\(s\) 和 \(t\) 组合在一起形成集群 \(u\) 。让 \(v\) 是林中不是 \(u\) 的任何剩余集群。

以下是计算新形成的簇 \(u\) 和每个 \(v\) 之间距离的方法。

method=’single’ assigns

\[d(u,v) = \min(dist(u[i],v[j]))\]

for all points \(i\) in cluster \(u\) and \(j\) in cluster \(v\). This is also known as the Nearest Point Algorithm.

method=’complete’ assigns

\[d(u, v) = \max(dist(u[i],v[j]))\]

for all points \(i\) in cluster u and \(j\) in cluster \(v\). This is also known by the Farthest Point Algorithm or Voor Hees Algorithm.

method=’average’ assigns

\[d(u,v) = \sum_{ij} \frac{d(u[i], v[j])} {(|u|*|v|)}\]

for all points \(i\) and \(j\) where \(|u|\) and \(|v|\) are the cardinalities of clusters \(u\) and \(v\), respectively. This is also called the UPGMA algorithm.

method=’weighted’ assigns

\[d(u,v) = (dist(s,v) + dist(t,v))/2\]

where cluster u was formed with cluster s and t and v is a remaining cluster in the forest (also called WPGMA).

method=’centroid’ assigns

\[dist(s,t) = ||c_s-c_t||_2\]

where \(c_s\) and \(c_t\) are the centroids of clusters \(s\) and \(t\), respectively. When two clusters \(s\) and \(t\) are combined into a new cluster \(u\), the new centroid is computed over all the original objects in clusters \(s\) and \(t\). The distance then becomes the Euclidean distance between the centroid of \(u\) and the centroid of a remaining cluster \(v\) in the forest. This is also known as the UPGMC algorithm.

method=’median’ assigns \(d(s,t)\) like the centroid method. When two clusters \(s\) and \(t\) are combined into a new cluster \(u\), the average of centroids s and t give the new centroid \(u\). This is also known as the WPGMC algorithm.

method=’ward’ uses the Ward variance minimization algorithm. The new entry \(d(u,v)\) is computed as follows,

\[d(u,v) = \sqrt{\frac{|v|+|s|} {T}d(v,s)^2 + \frac{|v|+|t|} {T}d(v,t)^2 - \frac{|v|} {T}d(s,t)^2}\]

where \(u\) is the newly joined cluster consisting of clusters \(s\) and \(t\), \(v\) is an unused cluster in the forest, \(T=|v|+|s|+|t|\), and \(|*|\) is the cardinality of its argument. This is also known as the incremental algorithm.

警告：选择森林中的最小距离对时，可能有两个或更多对具有相同的最小距离。此实现可以选择与 MATLAB 版本不同的最小值。

参数：：

y： ndarray: 一个压缩的距离矩阵。压缩距离矩阵是包含距离矩阵的上三角形的平面阵列。这是pdist 返回的形式。或者，\(n\) 维度中的\(m\) 观察向量的集合可以作为\(m\) 由\(n\) 数组传递。压缩距离矩阵的所有元素都必须是有限的，即没有 NaNs 或 infs。
method： str，可选: 要使用的链接算法。有关完整说明，请参阅下面的 Linkage Methods 部分。
metric： str 或函数，可选: 在 y 是观察向量集合的情况下使用的距离度量；否则忽略。有关有效距离度量的列表，请参阅pdist 函数。也可以使用自定义距离函数。
optimal_ordering：布尔型，可选: 如果为 True，则链接矩阵将重新排序，以便连续叶子之间的距离最小。当数据可视化时，这会产生更直观的树结构。默认为 False，因为该算法可能很慢，特别是在大型数据集上 [2]。另请参见 optimal_leaf_ordering 函数。

Z： ndarray: 层次聚类编码为链接矩阵。

注意：

对于方法‘single’，实现了基于最小生成树的优化算法。它的时间复杂度为\(O(n^2)\)。对于方法‘complete’, ‘average’, ‘weighted’和‘ward’，实现了一种称为nearest-neighbors链的算法。它还具有时间复杂度\(O(n^2)\)。对于其他方法，以 \(O(n^3)\) 时间复杂度实现朴素算法。所有算法都使用\(O(n^2)\) 内存。有关算法的详细信息，请参阅[1]。
仅当使用欧几里德成对度量时，方法 ‘centroid’, ‘median’ 和 ‘ward’ 才能正确定义。如果 y 作为预先计算的成对距离传递，则用户有责任确保这些距离实际上是欧几里得距离，否则生成的结果将不正确。

参考：

[1]

Daniel Mullner，“现代层次、凝聚聚类算法”，arXiv:1109.2378v1。

[2]

Ziv Bar-Joseph、David K. Gifford、Tommi S. Jaakkola，“层次聚类的快速最优叶排序”，2001 年。生物信息学 DOI:10.1093/bioinformatics/17.suppl_1.S22

例子：

>>> from scipy.cluster.hierarchy import dendrogram, linkage
>>> from matplotlib import pyplot as plt
>>> X = [[i] for i in [2, 8, 0, 4, 1, 9, 9, 0]]

>>> Z = linkage(X, 'ward')
>>> fig = plt.figure(figsize=(25, 10))
>>> dn = dendrogram(Z)

>>> Z = linkage(X, 'single')
>>> fig = plt.figure(figsize=(25, 10))
>>> dn = dendrogram(Z)
>>> plt.show()

相关用法

注：本文由纯净天空筛选整理自scipy.org大神的英文原创作品 scipy.cluster.hierarchy.linkage。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。

用法:

参数 ：：

返回 ：：

注意：

参考：

例子：

参数：：

返回：：