R dist 距离矩阵计算

R语言 dist 位于 stats 包(package)。

说明

此函数计算并返回通过使用指定的距离度量来计算数据矩阵的行之间的距离而计算出的距离矩阵。

用法

dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)

as.dist(m, diag = FALSE, upper = FALSE)
## Default S3 method:
as.dist(m, diag = FALSE, upper = FALSE)

## S3 method for class 'dist'
print(x, diag = NULL, upper = NULL,
      digits = getOption("digits"), justify = "none",
      right = TRUE, ...)

## S3 method for class 'dist'
as.matrix(x, ...)

参数

`x`	数字矩阵、 DataFrame 或 `"dist"` 对象。
`method`	要使用的距离测量。这必须是 `"euclidean"` 、 `"maximum"` 、 `"manhattan"` 、 `"canberra"` 、 `"binary"` 或 `"minkowski"` 之一。可以给出任何明确的子字符串。
`diag`	指示距离矩阵的对角线是否应由 `print.dist` 打印的逻辑值。
`upper`	指示距离矩阵的上三角形是否应由 `print.dist` 打印的逻辑值。
`p`	闵可夫斯基距离的威力。
`m`	具有距离信息的对象将转换为 `"dist"` 对象。对于默认方法， `"dist"` 对象，或(距离)矩阵或可以使用 `as.matrix()` 强制转换为此类矩阵的对象。 (仅使用矩阵的下三角形，其余部分被忽略)。
`digits, justify`	传递给 `print()` 内部的 `format` 。
`right, ...`	进一步的参数，传递给其他方法。

细节

可用的距离测量为(为两个向量 x 和 y 编写)：

euclidean：

两个向量之间的通常距离(2范数又名 L_2 )、 \sqrt{\sum_i (x_i - y_i)^2} 。

maximum：

x 和 y 两个组件之间的最大距离(最高范数)

manhattan：

两个向量之间的绝对距离(1范数又名L_1)。

canberra：

\sum_i |x_i - y_i| / (|x_i| + |y_i|) 。分子和分母为零的项将从总和中省略，并被视为缺失值。

这适用于非负值(例如计数)，在这种情况下，分母可以用各种等效的方式编写；起初，R用过的x_i + y_i，然后从 1998 年到 2017 年，|x_i + y_i|，然后正确的|x_i| + |y_i|.

binary：

(又名非对称二进制)：向量被视为二进制位，因此非零元素为‘on’，零元素为‘off’。该距离是其中只有一位为开的位与至少一位为开的位的比例。在某些情况下，这也称为“Jaccard”距离。这里，两个 all-zero 观测值具有距离 0 ，而在传统的 Jaccard 定义中，该距离对于这种情况是未定义的，并以数字形式给出 NaN 。

minkowski：

p 范数，各分量差的 p 次幂之和的 p 次方根。

允许存在缺失值，并且将其排除在涉及缺失值所在行的所有计算之外。此外，当涉及Inf值时，当它们对距离的贡献为NaN或NA时，所有值对都被排除。如果在计算欧几里德距离、曼哈顿距离、堪培拉距离或明可夫斯基距离时排除某些列，则总和将按比例按比例放大到所使用的列数。如果在计算特定距离时排除所有对，则该值为 NA 。

as.matrix()和as.dist()的"dist"方法可用于"dist"类对象与常规距离矩阵之间的转换。

as.dist() 是一个通用函数。它的默认方法处理从类 "dist" 继承的对象，或使用 as.matrix() 强制转换为矩阵的对象。可以通过为此类提供 as.matrix() 或更直接的 as.dist 方法来添加对表示距离(也称为相异性)的类的支持。

值

dist 返回类 "dist" 的对象。

按向量中的列存储的距离矩阵的下三角形，例如 do 。如果 n 是观测值的数量，即 n <- attr(do, "Size") ，则对于 i < j \le n ，(行) i 和 j 之间的差异为 do[n*(i-1) - i*(i-1)/2 + j-i] 。向量的长度为 n*(n-1)/2 ，即阶数为 n^2 。

该对象具有以下属性(除了 "class" 等于 "dist" 之外)：

`Size`	整数，数据集中的观测值数量。
`Labels`	可选地，包含数据集观测值的标签(如果有)。
`Diag, Upper`	与上面的参数 `diag` 和 `upper` 相对应的逻辑，指定应如何打印对象。
`call`	(可选)用于创建对象的`call`。
`method`	可选地，使用的距离方法；由 `dist()` 和(`match.arg()` ed)`method` 参数产生。

例子

require(graphics)

x <- matrix(rnorm(100), nrow = 5)
dist(x)
dist(x, diag = TRUE)
dist(x, upper = TRUE)
m <- as.matrix(dist(x))
d <- as.dist(m)
stopifnot(d == dist(x))

## Use correlations between variables "as distance"
dd <- as.dist((1 - cor(USJudgeRatings))/2)
round(1000 * dd) # (prints more nicely)
plot(hclust(dd)) # to see a dendrogram of clustered variables

## example of binary and canberra distances.
x <- c(0, 0, 1, 1, 1, 1)
y <- c(1, 0, 1, 1, 0, 1)
dist(rbind(x, y), method = "binary")
## answer 0.4 = 2/5
dist(rbind(x, y), method = "canberra")
## answer 2 * (6/5)

## To find the names
labels(eurodist)

## Examples involving "Inf" :
## 1)
x[6] <- Inf
(m2 <- rbind(x, y))
dist(m2, method = "binary")   # warning, answer 0.5 = 2/4
## These all give "Inf":
stopifnot(Inf == dist(m2, method =  "euclidean"),
          Inf == dist(m2, method =  "maximum"),
          Inf == dist(m2, method =  "manhattan"))
##  "Inf" is same as very large number:
x1 <- x; x1[6] <- 1e100
stopifnot(dist(cbind(x, y), method = "canberra") ==
    print(dist(cbind(x1, y), method = "canberra")))

## 2)
y[6] <- Inf #-> 6-th pair is excluded
dist(rbind(x, y), method = "binary"  )   # warning; 0.5
dist(rbind(x, y), method = "canberra"  ) # 3
dist(rbind(x, y), method = "maximum")    # 1
dist(rbind(x, y), method = "manhattan")  # 2.4

参考

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979) Multivariate Analysis. Academic Press.

Borg, I. and Groenen, P. (1997) Modern Multidimensional Scaling. Theory and Applications. Springer.

也可以看看

cluster 包中的 daisy 在混合(连续/分类)变量的情况下具有更多可能性。 hclust 。

相关用法

注：本文由纯净天空筛选整理自R-devel大神的英文原创作品 Distance Matrix Computation。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。