R boot 引導重采樣 - 純淨天空

R語言 boot 位於 boot 包(package)。

說明

生成應用於數據的統計數據的 R 引導複製。參數和非參數重采樣都是可能的。對於非參數引導，可能的重采樣方法有普通引導、平衡引導、對偶重采樣和排列。對於非參數多樣本問題，使用分層重采樣：這是通過在 boot 調用中包含層向量來指定的。可以指定重要性重采樣權重。

用法

boot(data, statistic, R, sim = "ordinary", stype = c("i", "f", "w"), 
     strata = rep(1,n), L = NULL, m = 0, weights = NULL, 
     ran.gen = function(d, p) d, mle = NULL, simple = FALSE, ...,
     parallel = c("no", "multicore", "snow"),
     ncpus = getOption("boot.ncpus", 1L), cl = NULL)

參數

`data`	數據為向量、矩陣或 DataFrame 。如果它是矩陣或 DataFrame ，則每一行都被視為一個多變量觀察。
`statistic`	應用於數據時返回包含感興趣統計數據的向量的函數。當 `sim = "parametric"` 時，`statistic` 的第一個參數必須是數據。對於每個複製，將傳遞 `ran.gen` 返回的模擬數據集。在所有其他情況下，`statistic` 必須至少采用兩個參數。傳遞的第一個參數始終是原始數據。第二個是定義引導樣本的索引、頻率或權重向量。此外，如果需要預測，則需要第三個參數，它是用於生成引導預測的隨機索引的向量。任何其他參數都可以通過 `...` 參數傳遞給 `statistic`。
`R`	引導程序重複的數量。通常這將是一個正整數。對於重要性重采樣，一些重采樣可能使用一組權重，而其他重采樣則使用一組不同的權重。在這種情況下，`R` 將是一個整數向量，其中每個分量給出每行權重的重新采樣數。
`sim`	指示所需模擬類型的字符串。可能的值為 `"ordinary"` (默認值)、 `"parametric"` 、 `"balanced"` 、 `"permutation"` 或 `"antithetic"` 。重要性重采樣是通過包含重要性權重來指定的；仍然必須指定重要性重采樣的類型，但在這種情況下隻能是 `"ordinary"` 或 `"balanced"`。
`stype`	指示 `statistic` 的第二個參數代表什麽的字符串。 stype 的可能值為 `"i"`(索引 - 默認值)、`"f"`(頻率)或 `"w"`(權重)。不用於`sim = "parametric"`。
`strata`	指定多樣本問題的層的整數向量或因子。這可以為任何模擬指定，但在 `sim = "parametric"` 時被忽略。當為非參數引導程序提供`strata` 時，模擬將在指定的層內完成。
`L`	在觀察中評估的影響值向量。僅當 `sim` 為 `"antithetic"` 時使用。如果未提供，則通過調用 `empinf` 來計算它們。如果 `stype` 是 `"w"` ，這將使用無窮小折刀，否則使用通常的折刀。
`m`	每次引導複製時要進行的預測數量。這對於(廣義)線性模型最有用。僅當 `sim` 為 `"ordinary"` 時才能使用。 `m` 通常是一個整數，但如果有層，它可能是一個長度等於層數的向量，指定有多少預測誤差應來自每個層。實際預測應作為 `statistic` 輸出的最後部分返回，該輸出還應采用一個參數，給出用於預測的錯誤索引向量。
`weights`	重要性權重向量或矩陣。如果是向量，那麽它應該具有與 `data` 中的觀察值一樣多的元素。當需要來自多於一組權重的模擬時，`weights`應該是一個矩陣，其中矩陣的每一行都是一組重要性權重。如果 `weights` 是矩陣，則 `R` 必須是長度為 `nrow(weights)` 的向量。如果 `sim` 不是 `"ordinary"` 或 `"balanced"` ，則忽略此參數。
`ran.gen`	此函數僅在 `sim = "parametric"` 說明如何生成隨機值時使用。它應該是兩個參數的函數。第一個參數應該是觀察到的數據，第二個參數由所需的任何其他信息(例如參數估計)組成。第二個參數可以是一個列表，允許將任意數量的項目傳遞給 `ran.gen` 。返回的值應該是與觀察到的數據具有相同形式的模擬數據集，該數據集將被傳遞到`statistic`以獲取引導複製。返回的值必須與原始數據集具有相同的形狀和類型，這一點很重要。如果未指定 `ran.gen`，則默認為返回原始 `data` 的函數，在這種情況下，所有模擬都應包含在 `statistic` 中。將 `sim = "parametric"` 與合適的 `ran.gen` 一起使用允許用戶實現不直接支持的任何類型的非參數重采樣。
`mle`	要傳遞給 `ran.gen` 的第二個參數。通常，這些將是參數的最大似然估計。為了提高效率，`mle` 通常是一個包含 `ran.gen` 所需的所有對象的列表，可以僅使用原始數據集進行計算。
`simple`	邏輯上，僅允許為 `sim = "ordinary", stype = "i", n = 0` 的 `TRUE`(否則將被忽略並出現警告)。默認情況下，創建 `n` by `R` 索引數組：這可能很大，如果是 `simple = TRUE`，可以通過對每個複製單獨采樣來避免這種情況，這種方式速度較慢，但使用的內存較少。
`...`	`statistic` 的其他命名參數在每次調用時都按原樣傳遞。 `statistic` 的任何此類參數都應遵循模擬所需的 `statistic` 參數。請注意與上麵列出的 `boot` 參數的部分匹配，並且名為 `X` 和 `FUN` 的參數會導致 `boot` 的某些版本(但不是這個)發生衝突。
`parallel`	要使用的並行操作的類型(如果有)。如果缺少，則默認值取自選項 `"boot.parallel"` (如果未設置，則使用 `"no"` )。
`ncpus`	整數：並行操作中要使用的進程數：通常會選擇可用 CPU 的數量。
`cl`	如果 `parallel = "snow"` 則使用可選的 `parallel` 或 `snow` 集群。如果未提供，則在 `boot` 調用期間在本地計算機上創建一個集群。

細節

要引導的統計數據可以根據需要簡單或複雜，隻要其參數對應於數據集和(對於非參數引導)索引、頻率或權重向量。 statistic 被 boot 函數視為黑匣子，不會進行檢查以確保滿足這些條件。

Davison、Hinkley 和 Schechtman (1986) 說明了一階平衡引導。 Hall (1989) 說明了對立引導程序，並且是實驗性的，特別是與分層一起使用時。其他非參數模擬類型是普通引導程序(可能具有不等概率)和返回案例隨機排列的排列。如果提供了該參數，所有這些方法都在層內獨立工作。

對於參數引導程序，用戶必須指定如何進行重采樣。實現此目的的最佳方法是指定函數 ran.gen ，該函數將從觀察到的數據集返回模擬數據集以及 mle 中指定的一組參數估計值。

值

返回的值是類 "boot" 的對象，包含以下組件：

`t0`	`statistic` 的觀測值應用於 `data` 。
`t`	具有 `sum(R)` 行的矩陣，每行都是調用 `statistic` 結果的引導複製。
`R`	`R` 的值傳遞給 `boot` 。
`data`	`data` 傳遞給 `boot` 。
`seed`	`boot`開始工作時`.Random.seed`的值。
`statistic`	函數 `statistic` 傳遞給 `boot` 。
`sim`	使用模擬類型。
`stype`	傳遞給 `boot` 的統計類型。
`call`	對 `boot` 的原始調用。
`strata`	使用的地層。如果提供了的話，這是傳遞給 `boot` 的向量；如果沒有層，則是傳遞給 1 的向量。如果 `sim` 是 `"parametric"` ，則不會返回。
`weights`	如果未指定重要性采樣權重，則傳遞給 `boot` 的重要性采樣權重或經驗分布函數權重。如果 `sim` 不是 `"ordinary"` 或 `"balanced"` 之一，則省略它。
`pred.i`	如果需要預測(`m > 0`)，這是在傳遞給統計數據時計算預測的索引矩陣。如果 `m` 是 `0` 或 `sim` 不是 `"ordinary"` 則省略。
`L`	`sim` 為 `"antithetic"` 時使用的影響力值。如果未指定此類值並且 `stype` 不是 `"w"`，則 `L` 將作為與數據按影響值排序的假設相對應的連續整數返回。當 `sim` 不是 `"antithetic"` 時，該組件被省略。
`ran.gen`	如果 `sim` 為 `"parametric"` 則使用隨機生成器函數。對於 `sim` 的任何其他值，都會忽略此組件。
`mle`	當 `sim` 為 `"parametric"` 時傳遞給 `boot` 的參數估計值。對於 `sim` 的所有其他值，它被省略。

該類有 c 、 plot 和 print 方法。

並聯運行

當使用parallel = "multicore"時(在Windows上不可用)，每個工作進程都會繼承當前會話的環境，包括工作空間和加載的命名空間以及附加的包(但不包括隨機數種子：見下文)。

當需要做更多的工作時parallel = "snow"isused：工作進程是新創建的R流程，以及statistic需要安排設置它需要的環境：通常一個好方法是使用詞法作用域，因為什麽時候statistic被發送到工作進程，其封閉環境也被發送。 (例如，參見示例jack.after.boot其中輔助函數嵌套在statistic函數。)parallel = "snow"主要用於多核 Windows 機器，其中parallel = "multicore"不可用。

對於大多數 boot 方法，重采樣是在主進程中完成的，但如果 simple = TRUE 或 sim = "parametric" 則不會。在這些情況下(或者 statistic 本身使用隨機數)，如果結果需要可重現，則需要更加小心。重采樣是通過 censboot(sim = "wierd") 和 tsboot 中的大多數方案在工作進程中完成的(sim == "fixed" 和 sim == "geom" 除外，默認為 ran.gen )。

當隨機數生成在工作進程中完成時，默認行為是每個工作進程選擇一個單獨的種子，且不可重複。但是，對於使用默認集群的parallel = "multicore" 或parallel = "snow"，如果選擇了RNGkind("L'Ecuyer-CMRG")，則使用第二種方法。在這種方法中，每個工作人員根據生成工作人員時的種子獲得 RNG 流的不同子序列，因此如果 ncpus 未更改，則結果將是可重現的，而對於 parallel = "multicore"，如果調用 parallel::mc.reset.stream()：請參閱 mclapply 的示例。

請注意，加載 parallel 命名空間可能會更改隨機種子，因此為了獲得最大的可重複性，應在調用此函數之前完成此操作。

例子


# Usual bootstrap of the ratio of means using the city data
ratio <- function(d, w) sum(d$x * w)/sum(d$u * w)
boot(city, ratio, R = 999, stype = "w")


# Stratified resampling for the difference of means.  In this
# example we will look at the difference of means between the final
# two series in the gravity data.
diff.means <- function(d, f)
{    n <- nrow(d)
     gp1 <- 1:table(as.numeric(d$series))[1]
     m1 <- sum(d[gp1,1] * f[gp1])/sum(f[gp1])
     m2 <- sum(d[-gp1,1] * f[-gp1])/sum(f[-gp1])
     ss1 <- sum(d[gp1,1]^2 * f[gp1]) - (m1 *  m1 * sum(f[gp1]))
     ss2 <- sum(d[-gp1,1]^2 * f[-gp1]) - (m2 *  m2 * sum(f[-gp1]))
     c(m1 - m2, (ss1 + ss2)/(sum(f) - 2))
}
grav1 <- gravity[as.numeric(gravity[,2]) >= 7,]
boot(grav1, diff.means, R = 999, stype = "f", strata = grav1[,2])

# In this example we show the use of boot in a prediction from
# regression based on the nuclear data.  This example is taken
# from Example 6.8 of Davison and Hinkley (1997).  Notice also
# that two extra arguments to 'statistic' are passed through boot.
nuke <- nuclear[, c(1, 2, 5, 7, 8, 10, 11)]
nuke.lm <- glm(log(cost) ~ date+log(cap)+ne+ct+log(cum.n)+pt, data = nuke)
nuke.diag <- glm.diag(nuke.lm)
nuke.res <- nuke.diag$res * nuke.diag$sd
nuke.res <- nuke.res - mean(nuke.res)

# We set up a new data frame with the data, the standardized 
# residuals and the fitted values for use in the bootstrap.
nuke.data <- data.frame(nuke, resid = nuke.res, fit = fitted(nuke.lm))

# Now we want a prediction of plant number 32 but at date 73.00
new.data <- data.frame(cost = 1, date = 73.00, cap = 886, ne = 0,
                       ct = 0, cum.n = 11, pt = 1)
new.fit <- predict(nuke.lm, new.data)

nuke.fun <- function(dat, inds, i.pred, fit.pred, x.pred)
{
     lm.b <- glm(fit+resid[inds] ~ date+log(cap)+ne+ct+log(cum.n)+pt,
                 data = dat)
     pred.b <- predict(lm.b, x.pred)
     c(coef(lm.b), pred.b - (fit.pred + dat$resid[i.pred]))
}

nuke.boot <- boot(nuke.data, nuke.fun, R = 999, m = 1, 
                  fit.pred = new.fit, x.pred = new.data)
# The bootstrap prediction squared error would then be found by
mean(nuke.boot$t[, 8]^2)
# Basic bootstrap prediction limits would be
new.fit - sort(nuke.boot$t[, 8])[c(975, 25)]


# Finally a parametric bootstrap.  For this example we shall look 
# at the air-conditioning data.  In this example our aim is to test 
# the hypothesis that the true value of the index is 1 (i.e. that 
# the data come from an exponential distribution) against the 
# alternative that the data come from a gamma distribution with
# index not equal to 1.
air.fun <- function(data) {
     ybar <- mean(data$hours)
     para <- c(log(ybar), mean(log(data$hours)))
     ll <- function(k) {
          if (k <= 0) 1e200 else lgamma(k)-k*(log(k)-1-para[1]+para[2])
     }
     khat <- nlm(ll, ybar^2/var(data$hours))$estimate
     c(ybar, khat)
}

air.rg <- function(data, mle) {
    # Function to generate random exponential variates.
    # mle will contain the mean of the original data
    out <- data
    out$hours <- rexp(nrow(out), 1/mle)
    out
}

air.boot <- boot(aircondit, air.fun, R = 999, sim = "parametric",
                 ran.gen = air.rg, mle = mean(aircondit$hours))

# The bootstrap p-value can then be approximated by
sum(abs(air.boot$t[,2]-1) > abs(air.boot$t0[2]-1))/(1+air.boot$R)

參考

There are many references explaining the bootstrap and its variations. Among them are :

Booth, J.G., Hall, P. and Wood, A.T.A. (1993) Balanced importance resampling for the bootstrap. Annals of Statistics, 21, 286-298.

Davison, A.C. and Hinkley, D.V. (1997) Bootstrap Methods and Their Application. Cambridge University Press.

Davison, A.C., Hinkley, D.V. and Schechtman, E. (1986) Efficient bootstrap simulation. Biometrika, 73, 555-566.

Efron, B. and Tibshirani, R. (1993) An Introduction to the Bootstrap. Chapman & Hall.

Gleason, J.R. (1988) Algorithms for balanced bootstrap simulations. American Statistician, 42, 263-266.

Hall, P. (1989) Antithetic resampling for the bootstrap. Biometrika, 73, 713-724.

Hinkley, D.V. (1988) Bootstrap methods (with Discussion). Journal of the Royal Statistical Society, B, 50, 312-337, 355-370.

Hinkley, D.V. and Shi, S. (1989) Importance sampling and the nested bootstrap. Biometrika, 76, 435-446.

Johns M.V. (1988) Importance sampling for bootstrap confidence intervals. Journal of the American Statistical Association, 83, 709-714.

Noreen, E.W. (1989) Computer Intensive Methods for Testing Hypotheses. John Wiley & Sons.

也可以看看

boot.array、boot.ci、censboot、empinf、jack.after.boot、tilt.boot、tsboot。

相關用法

注：本文由純淨天空篩選整理自R-devel大神的英文原創作品 Bootstrap Resampling。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。