R boot 引导重采样 - 纯净天空

R语言 boot 位于 boot 包(package)。

说明

生成应用于数据的统计数据的 R 引导复制。参数和非参数重采样都是可能的。对于非参数引导，可能的重采样方法有普通引导、平衡引导、对偶重采样和排列。对于非参数多样本问题，使用分层重采样：这是通过在 boot 调用中包含层向量来指定的。可以指定重要性重采样权重。

用法

boot(data, statistic, R, sim = "ordinary", stype = c("i", "f", "w"), 
     strata = rep(1,n), L = NULL, m = 0, weights = NULL, 
     ran.gen = function(d, p) d, mle = NULL, simple = FALSE, ...,
     parallel = c("no", "multicore", "snow"),
     ncpus = getOption("boot.ncpus", 1L), cl = NULL)

参数

`data`	数据为向量、矩阵或 DataFrame 。如果它是矩阵或 DataFrame ，则每一行都被视为一个多变量观察。
`statistic`	应用于数据时返回包含感兴趣统计数据的向量的函数。当 `sim = "parametric"` 时，`statistic` 的第一个参数必须是数据。对于每个复制，将传递 `ran.gen` 返回的模拟数据集。在所有其他情况下，`statistic` 必须至少采用两个参数。传递的第一个参数始终是原始数据。第二个是定义引导样本的索引、频率或权重向量。此外，如果需要预测，则需要第三个参数，它是用于生成引导预测的随机索引的向量。任何其他参数都可以通过 `...` 参数传递给 `statistic`。
`R`	引导程序重复的数量。通常这将是一个正整数。对于重要性重采样，一些重采样可能使用一组权重，而其他重采样则使用一组不同的权重。在这种情况下，`R` 将是一个整数向量，其中每个分量给出每行权重的重新采样数。
`sim`	指示所需模拟类型的字符串。可能的值为 `"ordinary"` (默认值)、 `"parametric"` 、 `"balanced"` 、 `"permutation"` 或 `"antithetic"` 。重要性重采样是通过包含重要性权重来指定的；仍然必须指定重要性重采样的类型，但在这种情况下只能是 `"ordinary"` 或 `"balanced"`。
`stype`	指示 `statistic` 的第二个参数代表什么的字符串。 stype 的可能值为 `"i"`(索引 - 默认值)、`"f"`(频率)或 `"w"`(权重)。不用于`sim = "parametric"`。
`strata`	指定多样本问题的层的整数向量或因子。这可以为任何模拟指定，但在 `sim = "parametric"` 时被忽略。当为非参数引导程序提供`strata` 时，模拟将在指定的层内完成。
`L`	在观察中评估的影响值向量。仅当 `sim` 为 `"antithetic"` 时使用。如果未提供，则通过调用 `empinf` 来计算它们。如果 `stype` 是 `"w"` ，这将使用无穷小折刀，否则使用通常的折刀。
`m`	每次引导复制时要进行的预测数量。这对于(广义)线性模型最有用。仅当 `sim` 为 `"ordinary"` 时才能使用。 `m` 通常是一个整数，但如果有层，它可能是一个长度等于层数的向量，指定有多少预测误差应来自每个层。实际预测应作为 `statistic` 输出的最后部分返回，该输出还应采用一个参数，给出用于预测的错误索引向量。
`weights`	重要性权重向量或矩阵。如果是向量，那么它应该具有与 `data` 中的观察值一样多的元素。当需要来自多于一组权重的模拟时，`weights`应该是一个矩阵，其中矩阵的每一行都是一组重要性权重。如果 `weights` 是矩阵，则 `R` 必须是长度为 `nrow(weights)` 的向量。如果 `sim` 不是 `"ordinary"` 或 `"balanced"` ，则忽略此参数。
`ran.gen`	此函数仅在 `sim = "parametric"` 说明如何生成随机值时使用。它应该是两个参数的函数。第一个参数应该是观察到的数据，第二个参数由所需的任何其他信息(例如参数估计)组成。第二个参数可以是一个列表，允许将任意数量的项目传递给 `ran.gen` 。返回的值应该是与观察到的数据具有相同形式的模拟数据集，该数据集将被传递到`statistic`以获取引导复制。返回的值必须与原始数据集具有相同的形状和类型，这一点很重要。如果未指定 `ran.gen`，则默认为返回原始 `data` 的函数，在这种情况下，所有模拟都应包含在 `statistic` 中。将 `sim = "parametric"` 与合适的 `ran.gen` 一起使用允许用户实现不直接支持的任何类型的非参数重采样。
`mle`	要传递给 `ran.gen` 的第二个参数。通常，这些将是参数的最大似然估计。为了提高效率，`mle` 通常是一个包含 `ran.gen` 所需的所有对象的列表，可以仅使用原始数据集进行计算。
`simple`	逻辑上，仅允许为 `sim = "ordinary", stype = "i", n = 0` 的 `TRUE`(否则将被忽略并出现警告)。默认情况下，创建 `n` by `R` 索引数组：这可能很大，如果是 `simple = TRUE`，可以通过对每个复制单独采样来避免这种情况，这种方式速度较慢，但使用的内存较少。
`...`	`statistic` 的其他命名参数在每次调用时都按原样传递。 `statistic` 的任何此类参数都应遵循模拟所需的 `statistic` 参数。请注意与上面列出的 `boot` 参数的部分匹配，并且名为 `X` 和 `FUN` 的参数会导致 `boot` 的某些版本(但不是这个)发生冲突。
`parallel`	要使用的并行操作的类型(如果有)。如果缺少，则默认值取自选项 `"boot.parallel"` (如果未设置，则使用 `"no"` )。
`ncpus`	整数：并行操作中要使用的进程数：通常会选择可用 CPU 的数量。
`cl`	如果 `parallel = "snow"` 则使用可选的 `parallel` 或 `snow` 集群。如果未提供，则在 `boot` 调用期间在本地计算机上创建一个集群。

细节

要引导的统计数据可以根据需要简单或复杂，只要其参数对应于数据集和(对于非参数引导)索引、频率或权重向量。 statistic 被 boot 函数视为黑匣子，不会进行检查以确保满足这些条件。

Davison、Hinkley 和 Schechtman (1986) 说明了一阶平衡引导。 Hall (1989) 说明了对立引导程序，并且是实验性的，特别是与分层一起使用时。其他非参数模拟类型是普通引导程序(可能具有不等概率)和返回案例随机排列的排列。如果提供了该参数，所有这些方法都在层内独立工作。

对于参数引导程序，用户必须指定如何进行重采样。实现此目的的最佳方法是指定函数 ran.gen ，该函数将从观察到的数据集返回模拟数据集以及 mle 中指定的一组参数估计值。

值

返回的值是类 "boot" 的对象，包含以下组件：

`t0`	`statistic` 的观测值应用于 `data` 。
`t`	具有 `sum(R)` 行的矩阵，每行都是调用 `statistic` 结果的引导复制。
`R`	`R` 的值传递给 `boot` 。
`data`	`data` 传递给 `boot` 。
`seed`	`boot`开始工作时`.Random.seed`的值。
`statistic`	函数 `statistic` 传递给 `boot` 。
`sim`	使用模拟类型。
`stype`	传递给 `boot` 的统计类型。
`call`	对 `boot` 的原始调用。
`strata`	使用的地层。如果提供了的话，这是传递给 `boot` 的向量；如果没有层，则是传递给 1 的向量。如果 `sim` 是 `"parametric"` ，则不会返回。
`weights`	如果未指定重要性采样权重，则传递给 `boot` 的重要性采样权重或经验分布函数权重。如果 `sim` 不是 `"ordinary"` 或 `"balanced"` 之一，则省略它。
`pred.i`	如果需要预测(`m > 0`)，这是在传递给统计数据时计算预测的索引矩阵。如果 `m` 是 `0` 或 `sim` 不是 `"ordinary"` 则省略。
`L`	`sim` 为 `"antithetic"` 时使用的影响力值。如果未指定此类值并且 `stype` 不是 `"w"`，则 `L` 将作为与数据按影响值排序的假设相对应的连续整数返回。当 `sim` 不是 `"antithetic"` 时，该组件被省略。
`ran.gen`	如果 `sim` 为 `"parametric"` 则使用随机生成器函数。对于 `sim` 的任何其他值，都会忽略此组件。
`mle`	当 `sim` 为 `"parametric"` 时传递给 `boot` 的参数估计值。对于 `sim` 的所有其他值，它被省略。

该类有 c 、 plot 和 print 方法。

并联运行

当使用parallel = "multicore"时(在Windows上不可用)，每个工作进程都会继承当前会话的环境，包括工作空间和加载的命名空间以及附加的包(但不包括随机数种子：见下文)。

当需要做更多的工作时parallel = "snow"isused：工作进程是新创建的R流程，以及statistic需要安排设置它需要的环境：通常一个好方法是使用词法作用域，因为什么时候statistic被发送到工作进程，其封闭环境也被发送。 (例如，参见示例jack.after.boot其中辅助函数嵌套在statistic函数。)parallel = "snow"主要用于多核 Windows 机器，其中parallel = "multicore"不可用。

对于大多数 boot 方法，重采样是在主进程中完成的，但如果 simple = TRUE 或 sim = "parametric" 则不会。在这些情况下(或者 statistic 本身使用随机数)，如果结果需要可重现，则需要更加小心。重采样是通过 censboot(sim = "wierd") 和 tsboot 中的大多数方案在工作进程中完成的(sim == "fixed" 和 sim == "geom" 除外，默认为 ran.gen )。

当随机数生成在工作进程中完成时，默认行为是每个工作进程选择一个单独的种子，且不可重复。但是，对于使用默认集群的parallel = "multicore" 或parallel = "snow"，如果选择了RNGkind("L'Ecuyer-CMRG")，则使用第二种方法。在这种方法中，每个工作人员根据生成工作人员时的种子获得 RNG 流的不同子序列，因此如果 ncpus 未更改，则结果将是可重现的，而对于 parallel = "multicore"，如果调用 parallel::mc.reset.stream()：请参阅 mclapply 的示例。

请注意，加载 parallel 命名空间可能会更改随机种子，因此为了获得最大的可重复性，应在调用此函数之前完成此操作。

例子


# Usual bootstrap of the ratio of means using the city data
ratio <- function(d, w) sum(d$x * w)/sum(d$u * w)
boot(city, ratio, R = 999, stype = "w")


# Stratified resampling for the difference of means.  In this
# example we will look at the difference of means between the final
# two series in the gravity data.
diff.means <- function(d, f)
{    n <- nrow(d)
     gp1 <- 1:table(as.numeric(d$series))[1]
     m1 <- sum(d[gp1,1] * f[gp1])/sum(f[gp1])
     m2 <- sum(d[-gp1,1] * f[-gp1])/sum(f[-gp1])
     ss1 <- sum(d[gp1,1]^2 * f[gp1]) - (m1 *  m1 * sum(f[gp1]))
     ss2 <- sum(d[-gp1,1]^2 * f[-gp1]) - (m2 *  m2 * sum(f[-gp1]))
     c(m1 - m2, (ss1 + ss2)/(sum(f) - 2))
}
grav1 <- gravity[as.numeric(gravity[,2]) >= 7,]
boot(grav1, diff.means, R = 999, stype = "f", strata = grav1[,2])

# In this example we show the use of boot in a prediction from
# regression based on the nuclear data.  This example is taken
# from Example 6.8 of Davison and Hinkley (1997).  Notice also
# that two extra arguments to 'statistic' are passed through boot.
nuke <- nuclear[, c(1, 2, 5, 7, 8, 10, 11)]
nuke.lm <- glm(log(cost) ~ date+log(cap)+ne+ct+log(cum.n)+pt, data = nuke)
nuke.diag <- glm.diag(nuke.lm)
nuke.res <- nuke.diag$res * nuke.diag$sd
nuke.res <- nuke.res - mean(nuke.res)

# We set up a new data frame with the data, the standardized 
# residuals and the fitted values for use in the bootstrap.
nuke.data <- data.frame(nuke, resid = nuke.res, fit = fitted(nuke.lm))

# Now we want a prediction of plant number 32 but at date 73.00
new.data <- data.frame(cost = 1, date = 73.00, cap = 886, ne = 0,
                       ct = 0, cum.n = 11, pt = 1)
new.fit <- predict(nuke.lm, new.data)

nuke.fun <- function(dat, inds, i.pred, fit.pred, x.pred)
{
     lm.b <- glm(fit+resid[inds] ~ date+log(cap)+ne+ct+log(cum.n)+pt,
                 data = dat)
     pred.b <- predict(lm.b, x.pred)
     c(coef(lm.b), pred.b - (fit.pred + dat$resid[i.pred]))
}

nuke.boot <- boot(nuke.data, nuke.fun, R = 999, m = 1, 
                  fit.pred = new.fit, x.pred = new.data)
# The bootstrap prediction squared error would then be found by
mean(nuke.boot$t[, 8]^2)
# Basic bootstrap prediction limits would be
new.fit - sort(nuke.boot$t[, 8])[c(975, 25)]


# Finally a parametric bootstrap.  For this example we shall look 
# at the air-conditioning data.  In this example our aim is to test 
# the hypothesis that the true value of the index is 1 (i.e. that 
# the data come from an exponential distribution) against the 
# alternative that the data come from a gamma distribution with
# index not equal to 1.
air.fun <- function(data) {
     ybar <- mean(data$hours)
     para <- c(log(ybar), mean(log(data$hours)))
     ll <- function(k) {
          if (k <= 0) 1e200 else lgamma(k)-k*(log(k)-1-para[1]+para[2])
     }
     khat <- nlm(ll, ybar^2/var(data$hours))$estimate
     c(ybar, khat)
}

air.rg <- function(data, mle) {
    # Function to generate random exponential variates.
    # mle will contain the mean of the original data
    out <- data
    out$hours <- rexp(nrow(out), 1/mle)
    out
}

air.boot <- boot(aircondit, air.fun, R = 999, sim = "parametric",
                 ran.gen = air.rg, mle = mean(aircondit$hours))

# The bootstrap p-value can then be approximated by
sum(abs(air.boot$t[,2]-1) > abs(air.boot$t0[2]-1))/(1+air.boot$R)

参考

There are many references explaining the bootstrap and its variations. Among them are :

Booth, J.G., Hall, P. and Wood, A.T.A. (1993) Balanced importance resampling for the bootstrap. Annals of Statistics, 21, 286-298.

Davison, A.C. and Hinkley, D.V. (1997) Bootstrap Methods and Their Application. Cambridge University Press.

Davison, A.C., Hinkley, D.V. and Schechtman, E. (1986) Efficient bootstrap simulation. Biometrika, 73, 555-566.

Efron, B. and Tibshirani, R. (1993) An Introduction to the Bootstrap. Chapman & Hall.

Gleason, J.R. (1988) Algorithms for balanced bootstrap simulations. American Statistician, 42, 263-266.

Hall, P. (1989) Antithetic resampling for the bootstrap. Biometrika, 73, 713-724.

Hinkley, D.V. (1988) Bootstrap methods (with Discussion). Journal of the Royal Statistical Society, B, 50, 312-337, 355-370.

Hinkley, D.V. and Shi, S. (1989) Importance sampling and the nested bootstrap. Biometrika, 76, 435-446.

Johns M.V. (1988) Importance sampling for bootstrap confidence intervals. Journal of the American Statistical Association, 83, 709-714.

Noreen, E.W. (1989) Computer Intensive Methods for Testing Hypotheses. John Wiley & Sons.

也可以看看

boot.array、boot.ci、censboot、empinf、jack.after.boot、tilt.boot、tsboot。

相关用法

注：本文由纯净天空筛选整理自R-devel大神的英文原创作品 Bootstrap Resampling。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。