R dist 距離矩陣計算

R語言 dist 位於 stats 包(package)。

說明

此函數計算並返回通過使用指定的距離度量來計算數據矩陣的行之間的距離而計算出的距離矩陣。

用法

dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)

as.dist(m, diag = FALSE, upper = FALSE)
## Default S3 method:
as.dist(m, diag = FALSE, upper = FALSE)

## S3 method for class 'dist'
print(x, diag = NULL, upper = NULL,
      digits = getOption("digits"), justify = "none",
      right = TRUE, ...)

## S3 method for class 'dist'
as.matrix(x, ...)

參數

`x`	數字矩陣、 DataFrame 或 `"dist"` 對象。
`method`	要使用的距離測量。這必須是 `"euclidean"` 、 `"maximum"` 、 `"manhattan"` 、 `"canberra"` 、 `"binary"` 或 `"minkowski"` 之一。可以給出任何明確的子字符串。
`diag`	指示距離矩陣的對角線是否應由 `print.dist` 打印的邏輯值。
`upper`	指示距離矩陣的上三角形是否應由 `print.dist` 打印的邏輯值。
`p`	閔可夫斯基距離的威力。
`m`	具有距離信息的對象將轉換為 `"dist"` 對象。對於默認方法， `"dist"` 對象，或(距離)矩陣或可以使用 `as.matrix()` 強製轉換為此類矩陣的對象。 (僅使用矩陣的下三角形，其餘部分被忽略)。
`digits, justify`	傳遞給 `print()` 內部的 `format` 。
`right, ...`	進一步的參數，傳遞給其他方法。

細節

可用的距離測量為(為兩個向量 x 和 y 編寫)：

euclidean：

兩個向量之間的通常距離(2範數又名 L_2 )、 \sqrt{\sum_i (x_i - y_i)^2} 。

maximum：

x 和 y 兩個組件之間的最大距離(最高範數)

manhattan：

兩個向量之間的絕對距離(1範數又名L_1)。

canberra：

\sum_i |x_i - y_i| / (|x_i| + |y_i|) 。分子和分母為零的項將從總和中省略，並被視為缺失值。

這適用於非負值(例如計數)，在這種情況下，分母可以用各種等效的方式編寫；起初，R用過的x_i + y_i，然後從 1998 年到 2017 年，|x_i + y_i|，然後正確的|x_i| + |y_i|.

binary：

(又名非對稱二進製)：向量被視為二進製位，因此非零元素為‘on’，零元素為‘off’。該距離是其中隻有一位為開的位與至少一位為開的位的比例。在某些情況下，這也稱為“Jaccard”距離。這裏，兩個 all-zero 觀測值具有距離 0 ，而在傳統的 Jaccard 定義中，該距離對於這種情況是未定義的，並以數字形式給出 NaN 。

minkowski：

p 範數，各分量差的 p 次冪之和的 p 次方根。

允許存在缺失值，並且將其排除在涉及缺失值所在行的所有計算之外。此外，當涉及Inf值時，當它們對距離的貢獻為NaN或NA時，所有值對都被排除。如果在計算歐幾裏德距離、曼哈頓距離、堪培拉距離或明可夫斯基距離時排除某些列，則總和將按比例按比例放大到所使用的列數。如果在計算特定距離時排除所有對，則該值為 NA 。

as.matrix()和as.dist()的"dist"方法可用於"dist"類對象與常規距離矩陣之間的轉換。

as.dist() 是一個通用函數。它的默認方法處理從類 "dist" 繼承的對象，或使用 as.matrix() 強製轉換為矩陣的對象。可以通過為此類提供 as.matrix() 或更直接的 as.dist 方法來添加對表示距離(也稱為相異性)的類的支持。

值

dist 返回類 "dist" 的對象。

按向量中的列存儲的距離矩陣的下三角形，例如 do 。如果 n 是觀測值的數量，即 n <- attr(do, "Size") ，則對於 i < j \le n ，(行) i 和 j 之間的差異為 do[n*(i-1) - i*(i-1)/2 + j-i] 。向量的長度為 n*(n-1)/2 ，即階數為 n^2 。

該對象具有以下屬性(除了 "class" 等於 "dist" 之外)：

`Size`	整數，數據集中的觀測值數量。
`Labels`	可選地，包含數據集觀測值的標簽(如果有)。
`Diag, Upper`	與上麵的參數 `diag` 和 `upper` 相對應的邏輯，指定應如何打印對象。
`call`	(可選)用於創建對象的`call`。
`method`	可選地，使用的距離方法；由 `dist()` 和(`match.arg()` ed)`method` 參數產生。

例子

require(graphics)

x <- matrix(rnorm(100), nrow = 5)
dist(x)
dist(x, diag = TRUE)
dist(x, upper = TRUE)
m <- as.matrix(dist(x))
d <- as.dist(m)
stopifnot(d == dist(x))

## Use correlations between variables "as distance"
dd <- as.dist((1 - cor(USJudgeRatings))/2)
round(1000 * dd) # (prints more nicely)
plot(hclust(dd)) # to see a dendrogram of clustered variables

## example of binary and canberra distances.
x <- c(0, 0, 1, 1, 1, 1)
y <- c(1, 0, 1, 1, 0, 1)
dist(rbind(x, y), method = "binary")
## answer 0.4 = 2/5
dist(rbind(x, y), method = "canberra")
## answer 2 * (6/5)

## To find the names
labels(eurodist)

## Examples involving "Inf" :
## 1)
x[6] <- Inf
(m2 <- rbind(x, y))
dist(m2, method = "binary")   # warning, answer 0.5 = 2/4
## These all give "Inf":
stopifnot(Inf == dist(m2, method =  "euclidean"),
          Inf == dist(m2, method =  "maximum"),
          Inf == dist(m2, method =  "manhattan"))
##  "Inf" is same as very large number:
x1 <- x; x1[6] <- 1e100
stopifnot(dist(cbind(x, y), method = "canberra") ==
    print(dist(cbind(x1, y), method = "canberra")))

## 2)
y[6] <- Inf #-> 6-th pair is excluded
dist(rbind(x, y), method = "binary"  )   # warning; 0.5
dist(rbind(x, y), method = "canberra"  ) # 3
dist(rbind(x, y), method = "maximum")    # 1
dist(rbind(x, y), method = "manhattan")  # 2.4

參考

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979) Multivariate Analysis. Academic Press.

Borg, I. and Groenen, P. (1997) Modern Multidimensional Scaling. Theory and Applications. Springer.

也可以看看

cluster 包中的 daisy 在混合(連續/分類)變量的情況下具有更多可能性。 hclust 。

相關用法

注：本文由純淨天空篩選整理自R-devel大神的英文原創作品 Distance Matrix Computation。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。