R density 核密度估計

R語言 density 位於 stats 包(package)。

說明

(S3) 通用函數 density 計算核密度估計。它的默認方法使用給定的核和單變量觀測的帶寬來實現這一點。

用法

density(x, ...)
## Default S3 method:
density(x, bw = "nrd0", adjust = 1,
        kernel = c("gaussian", "epanechnikov", "rectangular",
                   "triangular", "biweight",
                   "cosine", "optcosine"),
        weights = NULL, window = kernel, width,
        give.Rkern = FALSE, subdensity = FALSE,
        warnWbw = var(weights) > 0,
        n = 512, from, to, cut = 3, ext = 4,
        old.coords = FALSE,
        na.rm = FALSE, ...)

參數

`x`	用於計算估計值的數據。對於默認方法，數字向量：不支持長向量。
`bw`	要使用的平滑帶寬。對內核進行縮放，使其成為平滑內核的標準差。 (請注意，這與下麵引用的參考書以及S-PLUS不同。) `bw`也可以是給出選擇帶寬的規則的字符串。看`bw.nrd`. 默認情況下，`"nrd0"`，出於曆史和兼容性原因仍然是默認值，而不是作為一般建議，例如，`"SJ"`更適合，另見 Venables 和 Ripley (2002)。 `bw` 的指定(或計算)值乘以 `adjust` 。
`adjust`	使用的帶寬實際上是 `adjust*bw` 。這樣可以輕鬆指定“默認帶寬的一半”等值。
`kernel` , `window`	給出要使用的平滑內核的字符串。這必須部分匹配 `"gaussian"` 、 `"rectangular"` 、 `"triangular"` 、 `"epanechnikov"` 、 `"biweight"` 、 `"cosine"` 或 `"optcosine"` 之一，默認為 `"gaussian"` ，並且可以縮寫為唯一的前綴(單個信)。 `"cosine"` 比 `"optcosine"` 更平滑，這是文獻中常見的 ‘cosine’ 內核，幾乎是MSE-efficient。然而`"cosine"`是S使用的版本。
`weights`	非負觀察權重的數值向量，因此與 `x` 長度相同。默認 `NULL` 相當於 `weights = rep(1/nx, nx)` ，其中 `nx` 是 `x[]` (的有限條目)的長度。如果 `na.rm = TRUE` 和 `x` 中有 `NA` ，則在計算之前刪除它們和相應的權重。在這種情況下，當原始權重總計為 1 時，它們會重新縮放以繼續這樣做。請注意，自動帶寬規則不考慮權重，即當 `bw` 是字符串時。當權重與真實計數 `cn` 成比例時，可以使用 `density(x = rep(x, cn))` 代替 `weights` 。
`width`	這是為了與 S 兼容而存在的；如果給定，而 `bw` 沒有，則將 `bw` 設置為 `width`(如果這是字符串)，或者設置為 `width` 的 kernel-dependent 倍數(如果這是數字)。
`give.Rkern`	邏輯性；如果為 true，則不估計密度，而是返回所選 `kernel` 的“規範帶寬”。
`subdensity`	僅當指定`weights`且其總和不為一時才使用。如果為 true，則表示需要 “sub-density”，並且不應發出任何警告信號。默認情況下，如果為 false，則當權重之和不等於 1 時，會發出 `warning` 信號。
`warnWbw`	`logical` ，僅在指定 `weights` 且 `bw` 為 `character` 時使用，即選擇自動帶寬選擇(默認情況下)。如果為 true(默認情況下)，則會發出 `warning` 信號，提醒用戶自動帶寬選擇不會考慮權重，因此可能不是最佳選擇。
`n`	要估計密度的等距點的數量。當 `n > 512` 時，在計算過程中會向上舍入到 2 的冪(因為使用了 `fft` )，並且最終結果由 `approx` 進行插值。因此，將 `n` 指定為 2 的冪幾乎總是有意義的。
`from` , `to`	要估計密度的網格的最左邊和最右邊的點；默認值是 `range(x)` 之外的 `cut * bw` 。
`cut`	默認情況下，`from` 和`to` 的值是超出數據極值的`cut` 帶寬。這使得估計密度在極端情況下下降到大約為零。
`ext`	正擴展因子，默認為`4`。值`from`和`to`在兩側進一步擴展為`lo <- from - ext * bw`和`up <- to + ext * bw`，然後用於構建用於FFT和插值的網格，請參見上麵的`n`。除非您知道自己在做什麽，否則不要改變！
`old.coords`	`logical` 需要 R 4.4.0 之前的行為，該行為會給出大約 `(1 + 1/(2n-2))` 的過大值。
`na.rm`	邏輯性；如果 `TRUE` ，則從 `x` 中刪除缺失值。如果 `FALSE` 任何缺失值都會導致錯誤。
`...`	(非默認)方法的進一步參數。

細節

density.default 中使用的算法將經驗分布函數的質量分散在至少 512 個點的規則網格上，然後使用快速傅裏葉變換將此近似值與核的離散版本進行卷積，然後使用線性近似來評估指定點處的密度。

內核的統計特性由下式決定：\sigma^2_K = \int t^2 K(t) dt這總是= 1對於我們的內核(因此帶寬bw是核的標準差)和R(K) = \int K^2(t) dt.
MSE-equivalent 帶寬(針對不同內核)與\sigma_K R(K)這是尺度不變的，對於我們的內核來說等於R(K)。當以下情況時返回該值give.Rkern = TRUE。請參閱使用精確等效帶寬的示例。

假定 x 中的無限值對應於 +/-Inf 處的點質量，並且密度估計是 (-Inf, +Inf) 上的 sub-density 的。

值

如果 give.Rkern 為 true，則為數字 R(K) ，否則為類 "density" 的對象，其底層結構是包含以下組件的列表。

`x`	估計密度的點的 `n` 坐標。
`y`	估計的密度值。這些將為非負數，但可以為零。
`bw`	使用的帶寬。
`n`	消除缺失值後的樣本量。
`call`	產生結果的調用。
`data.name`	`x` 參數的解析名稱。
`has.na`	邏輯，為了兼容性(始終 `FALSE` )。

print 方法報告 x 和 y 組件上的 summary 值。

例子

require(graphics)

plot(density(c(-20, rep(0,98), 20)), xlim = c(-4, 4))  # IQR = 0

# The Old Faithful geyser data
d <- density(faithful$eruptions, bw = "sj")
d
plot(d)

plot(d, type = "n")
polygon(d, col = "wheat")

## Missing values:
x <- xx <- faithful$eruptions
x[i.out <- sample(length(x), 10)] <- NA
doR <- density(x, bw = 0.15, na.rm = TRUE)
lines(doR, col = "blue")
points(xx[i.out], rep(0.01, 10))

## Weighted observations:
fe <- sort(faithful$eruptions) # has quite a few non-unique values
## use 'counts / n' as weights:
dw <- density(unique(fe), weights = table(fe)/length(fe), bw = d$bw)
utils::str(dw) ## smaller n: only 126, but identical estimate:
stopifnot(all.equal(d[1:3], dw[1:3]))

## simulation from a density() fit:
# a kernel density fit is an equally-weighted mixture.
fit <- density(xx)
N <- 1e6
x.new <- rnorm(N, sample(xx, size = N, replace = TRUE), fit$bw)
plot(fit)
lines(density(x.new), col = "blue")


## The available kernels:
(kernels <- eval(formals(density.default)$kernel))

## show the kernels in the R parametrization
plot (density(0, bw = 1), xlab = "",
      main = "R's density() kernels with bw = 1")
for(i in 2:length(kernels))
   lines(density(0, bw = 1, kernel =  kernels[i]), col = i)
legend(1.5,.4, legend = kernels, col = seq(kernels),
       lty = 1, cex = .8, y.intersp = 1)

## show the kernels in the S parametrization
plot(density(0, from = -1.2, to = 1.2, width = 2, kernel = "gaussian"),
     type = "l", ylim = c(0, 1), xlab = "",
     main = "R's density() kernels with width = 1")
for(i in 2:length(kernels))
   lines(density(0, width = 2, kernel =  kernels[i]), col = i)
legend(0.6, 1.0, legend = kernels, col = seq(kernels), lty = 1)

##-------- Semi-advanced theoretic from here on -------------


## Explore the old.coords TRUE --> FALSE change:
set.seed(7); x <- runif(2^12) # N = 4096
den  <- density(x) # -> grid of n = 512 points
den0 <- density(x, old.coords = TRUE)
summary(den0$y / den$y) # 1.001 ... 1.011
summary(    den0$y / den$y - 1) # ~= 1/(2n-2)
summary(1/ (den0$y / den$y - 1))# ~=    2n-2 = 1022
corr0 <- 1 - 1/(2*512-2) # 1 - 1/(2n-2)
all.equal(den$y, den0$y * corr0)# ~ 0.0001
plot(den$x, (den0$y - den$y)/den$y, type='o', cex=1/4)
title("relative error of density(runif(2^12), old.coords=TRUE)")
abline(h = 1/1022, v = range(x), lty=2); axis(2, at=1/1022, "1/(2n-2)", las=1)


## The R[K] for our kernels:
(RKs <- cbind(sapply(kernels,
                     function(k) density(kernel = k, give.Rkern = TRUE))))
100*round(RKs["epanechnikov",]/RKs, 4) ## Efficiencies

bw <- bw.SJ(precip) ## sensible automatic choice
plot(density(precip, bw = bw),
     main = "same sd bandwidths, 7 different kernels")
for(i in 2:length(kernels))
   lines(density(precip, bw = bw, kernel = kernels[i]), col = i)

## Bandwidth Adjustment for "Exactly Equivalent Kernels"
h.f <- sapply(kernels, function(k)density(kernel = k, give.Rkern = TRUE))
(h.f <- (h.f["gaussian"] / h.f)^ .2)
## -> 1, 1.01, .995, 1.007,... close to 1 => adjustment barely visible..

plot(density(precip, bw = bw),
     main = "equivalent bandwidths, 7 different kernels")
for(i in 2:length(kernels))
   lines(density(precip, bw = bw, adjust = h.f[i], kernel = kernels[i]),
         col = i)
legend(55, 0.035, legend = kernels, col = seq(kernels), lty = 1)

參考

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). The New S Language. Wadsworth & Brooks/Cole (for S version).

Scott, D. W. (1992). Multivariate Density Estimation. Theory, Practice and Visualization. New York: Wiley.

Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society Series B, 53, 683-690. doi:10.1111/j.2517-6161.1991.tb01857.x.

Silverman, B. W. (1986). Density Estimation. London: Chapman and Hall.

Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S. New York: Springer.

也可以看看

bw.nrd、plot.density、hist； fft 和 convolve 用於使用的計算捷徑。

相關用法

注：本文由純淨天空篩選整理自R-devel大神的英文原創作品 Kernel Density Estimation。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。