R dplyr join_by 加盟规格

join_by() 构建了一个规范，说明如何使用小型域特定语言连接两个表。结果可以作为 by 参数提供给任何连接函数(例如 left_join() )。

用法

join_by(...)

参数

...

指定连接的表达式。

每个表达式应包含以下内容之一：

相等条件：==
不等式条件： >= 、 > 、 <= 或 <
滚动助手：closest()
重叠助手：between()、within() 或 overlaps()

不支持其他表达式。如果您需要对计算变量执行联接，例如join_by(sales_date - 40 >= promo_date) ，您需要预先计算并将其存储在单独的列中。

列名称应指定为带引号或不带引号的名称。默认情况下，联接条件左侧的名称指的是左侧表，除非通过在列名称前显式添加 x$ 或 y$ 来覆盖。

如果在没有任何连接条件的情况下提供单个列名，则会将其解释为该列名在 == 的每一侧重复，即 x 被解释为 x == x 。

连接类型

dplyr 支持以下类型的联接：

平等加盟
不平等加入
滚动连接
重叠连接
交叉连接

下面将更详细地讨论相等、不等、滚动和重叠连接。交叉连接是通过 cross_join() 实现的。

平等加盟

相等连接要求一对或多对列之间的键相等，并且是最常见的连接类型。要使用 join_by() 构造相等联接，请提供两个要联接的列名称，并用 == 分隔。或者，提供单个名称将被解释为同名的两列之间的相等连接。例如， join_by(x) 相当于 join_by(x == x) 。

不平等加入

不等式连接匹配不等式，例如 > 、 >= 、 < 或 <= ，在时间序列分析和基因组学中很常见。要使用 join_by() 构造不等式联接，请提供由上述不等式之一分隔的两个列名称。

请注意，不等式连接会将 x 中的单行与 y 中的潜在大量行进行匹配。构建不等式连接规范时要格外小心！

滚动连接

滚动连接是不等式连接的一种变体，它限制从不等式连接条件返回的结果。当没有精确匹配时，它们对于"rolling"向前/向后最接近的匹配很有用。要构造滚动连接，请用 closest() 包装不等式。

closest(expr)

expr 必须是涉及以下之一的不等式： > 、 >= 、 < 或 <= 。

例如，closest(x >= y) 解释为：对于 x 中的每个值，在 y 中查找小于或等于 x 值的最接近值。

closest() 将始终使用左侧表 ( x ) 作为主表，并使用右侧表 ( y ) 作为查找最接近匹配项的表，无论如何指定不等式。例如， closest(y$a >= x$b) 将始终被解释为 closest(x$b <= y$a) 。

重叠连接

重叠连接是不等式连接的一种特殊情况，涉及左表中的一列或两列与右表中的两列定义的范围重叠。 join_by() 识别出三个帮助器来帮助构建重叠连接，所有这些都可以通过更简单的不等式来构建。

between(x, y_lower, y_upper, ..., bounds = "[]")

对于 x 中的每个值，这会找到该值落在 [y_lower, y_upper] 之间的所有位置。默认相当于x >= y_lower, x <= y_upper。

bounds 可以是 "[]" 、 "[)" 、 "(]" 或 "()" 之一，以更改下限和上限的包含性。这会改变 >= 或 > 以及 <= 或 < 用于构建上面所示的不等式。

点用于将来的扩展，并且必须为空。
within(x_lower, x_upper, y_lower, y_upper)

对于 [x_lower, x_upper] 中的每个范围，这会发现该范围完全落在 [y_lower, y_upper] 内。相当于x_lower >= y_lower, x_upper <= y_upper。

无论提供的范围是否包含在内，用于构建 within() 的不等式都是相同的。
overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]")

对于 [x_lower, x_upper] 中的每个范围，这会发现该范围在任何容量中都与 [y_lower, y_upper] 重叠。默认相当于x_lower <= y_upper, x_upper >= y_lower。

bounds 可以是 "[]" 、 "[)" 、 "(]" 或 "()" 之一，以更改下限和上限的包含性。 "[]" 使用 <= 和 >= ，但其他 3 个选项使用 < 和 > 并生成完全相同的不等式。

点用于将来的扩展，并且必须为空。

这些条件假设范围格式正确且非空，即当边界被视为 "[]" 时为 x_lower <= x_upper ，否则为 x_lower < x_upper 。

列引用

指定连接条件时，join_by() 假定条件左侧的列名称引用左侧表 ( x )，条件右侧的列名称引用右侧表 watch (y)。有时，能够在条件的左侧指定右侧表名称会更清晰，反之亦然。为了支持这一点，列名可以以 x$ 或 y$ 为前缀，以明确指定它们来自哪个表。

例子

sales <- tibble(
  id = c(1L, 1L, 1L, 2L, 2L),
  sale_date = as.Date(c("2018-12-31", "2019-01-02", "2019-01-05", "2019-01-04", "2019-01-01"))
)
sales
#> # A tibble: 5 × 2
#>      id sale_date 
#>   <int> <date>    
#> 1     1 2018-12-31
#> 2     1 2019-01-02
#> 3     1 2019-01-05
#> 4     2 2019-01-04
#> 5     2 2019-01-01

promos <- tibble(
  id = c(1L, 1L, 2L),
  promo_date = as.Date(c("2019-01-01", "2019-01-05", "2019-01-02"))
)
promos
#> # A tibble: 3 × 2
#>      id promo_date
#>   <int> <date>    
#> 1     1 2019-01-01
#> 2     1 2019-01-05
#> 3     2 2019-01-02

# Match `id` to `id`, and `sale_date` to `promo_date`
by <- join_by(id, sale_date == promo_date)
left_join(sales, promos, by)
#> # A tibble: 5 × 2
#>      id sale_date 
#>   <int> <date>    
#> 1     1 2018-12-31
#> 2     1 2019-01-02
#> 3     1 2019-01-05
#> 4     2 2019-01-04
#> 5     2 2019-01-01

# For each `sale_date` within a particular `id`,
# find all `promo_date`s that occurred before that particular sale
by <- join_by(id, sale_date >= promo_date)
left_join(sales, promos, by)
#> # A tibble: 6 × 3
#>      id sale_date  promo_date
#>   <int> <date>     <date>    
#> 1     1 2018-12-31 NA        
#> 2     1 2019-01-02 2019-01-01
#> 3     1 2019-01-05 2019-01-01
#> 4     1 2019-01-05 2019-01-05
#> 5     2 2019-01-04 2019-01-02
#> 6     2 2019-01-01 NA        

# For each `sale_date` within a particular `id`,
# find only the closest `promo_date` that occurred before that sale
by <- join_by(id, closest(sale_date >= promo_date))
left_join(sales, promos, by)
#> # A tibble: 5 × 3
#>      id sale_date  promo_date
#>   <int> <date>     <date>    
#> 1     1 2018-12-31 NA        
#> 2     1 2019-01-02 2019-01-01
#> 3     1 2019-01-05 2019-01-05
#> 4     2 2019-01-04 2019-01-02
#> 5     2 2019-01-01 NA        

# If you want to disallow exact matching in rolling joins, use `>` rather
# than `>=`. Note that the promo on `2019-01-05` is no longer considered the
# closest match for the sale on the same date.
by <- join_by(id, closest(sale_date > promo_date))
left_join(sales, promos, by)
#> # A tibble: 5 × 3
#>      id sale_date  promo_date
#>   <int> <date>     <date>    
#> 1     1 2018-12-31 NA        
#> 2     1 2019-01-02 2019-01-01
#> 3     1 2019-01-05 2019-01-01
#> 4     2 2019-01-04 2019-01-02
#> 5     2 2019-01-01 NA        

# Same as before, but also require that the promo had to occur at most 1
# day before the sale was made. We'll use a full join to see that id 2's
# promo on `2019-01-02` is no longer matched to the sale on `2019-01-04`.
sales <- mutate(sales, sale_date_lower = sale_date - 1)
by <- join_by(id, closest(sale_date >= promo_date), sale_date_lower <= promo_date)
full_join(sales, promos, by)
#> # A tibble: 6 × 4
#>      id sale_date  sale_date_lower promo_date
#>   <int> <date>     <date>          <date>    
#> 1     1 2018-12-31 2018-12-30      NA        
#> 2     1 2019-01-02 2019-01-01      2019-01-01
#> 3     1 2019-01-05 2019-01-04      2019-01-05
#> 4     2 2019-01-04 2019-01-03      NA        
#> 5     2 2019-01-01 2018-12-31      NA        
#> 6     2 NA         NA              2019-01-02

# ---------------------------------------------------------------------------

segments <- tibble(
  segment_id = 1:4,
  chromosome = c("chr1", "chr2", "chr2", "chr1"),
  start = c(140, 210, 380, 230),
  end = c(150, 240, 415, 280)
)
segments
#> # A tibble: 4 × 4
#>   segment_id chromosome start   end
#>        <int> <chr>      <dbl> <dbl>
#> 1          1 chr1         140   150
#> 2          2 chr2         210   240
#> 3          3 chr2         380   415
#> 4          4 chr1         230   280

reference <- tibble(
  reference_id = 1:4,
  chromosome = c("chr1", "chr1", "chr2", "chr2"),
  start = c(100, 200, 300, 415),
  end = c(150, 250, 399, 450)
)
reference
#> # A tibble: 4 × 4
#>   reference_id chromosome start   end
#>          <int> <chr>      <dbl> <dbl>
#> 1            1 chr1         100   150
#> 2            2 chr1         200   250
#> 3            3 chr2         300   399
#> 4            4 chr2         415   450

# Find every time a segment `start` falls between the reference
# `[start, end]` range.
by <- join_by(chromosome, between(start, start, end))
full_join(segments, reference, by)
#> # A tibble: 5 × 7
#>   segment_id chromosome start.x end.x reference_id start.y end.y
#>        <int> <chr>        <dbl> <dbl>        <int>   <dbl> <dbl>
#> 1          1 chr1           140   150            1     100   150
#> 2          2 chr2           210   240           NA      NA    NA
#> 3          3 chr2           380   415            3     300   399
#> 4          4 chr1           230   280            2     200   250
#> 5         NA chr2            NA    NA            4     415   450

# If you wanted the reference columns first, supply `reference` as `x`
# and `segments` as `y`, then explicitly refer to their columns using `x$`
# and `y$`.
by <- join_by(chromosome, between(y$start, x$start, x$end))
full_join(reference, segments, by)
#> # A tibble: 5 × 7
#>   reference_id chromosome start.x end.x segment_id start.y end.y
#>          <int> <chr>        <dbl> <dbl>      <int>   <dbl> <dbl>
#> 1            1 chr1           100   150          1     140   150
#> 2            2 chr1           200   250          4     230   280
#> 3            3 chr2           300   399          3     380   415
#> 4            4 chr2           415   450         NA      NA    NA
#> 5           NA chr2            NA    NA          2     210   240

# Find every time a segment falls completely within a reference.
# Sometimes using `x$` and `y$` makes your intentions clearer, even if they
# match the default behavior.
by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, by)
#> # A tibble: 1 × 7
#>   segment_id chromosome start.x end.x reference_id start.y end.y
#>        <int> <chr>        <dbl> <dbl>        <int>   <dbl> <dbl>
#> 1          1 chr1           140   150            1     100   150

# Find every time a segment overlaps a reference in any way.
by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))
full_join(segments, reference, by)
#> # A tibble: 5 × 7
#>   segment_id chromosome start.x end.x reference_id start.y end.y
#>        <int> <chr>        <dbl> <dbl>        <int>   <dbl> <dbl>
#> 1          1 chr1           140   150            1     100   150
#> 2          2 chr2           210   240           NA      NA    NA
#> 3          3 chr2           380   415            3     300   399
#> 4          3 chr2           380   415            4     415   450
#> 5          4 chr1           230   280            2     200   250

# It is common to have right-open ranges with bounds like `[)`, which would
# mean an end value of `415` would no longer overlap a start value of `415`.
# Setting `bounds` allows you to compute overlaps with those kinds of ranges.
by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end, bounds = "[)"))
full_join(segments, reference, by)
#> # A tibble: 5 × 7
#>   segment_id chromosome start.x end.x reference_id start.y end.y
#>        <int> <chr>        <dbl> <dbl>        <int>   <dbl> <dbl>
#> 1          1 chr1           140   150            1     100   150
#> 2          2 chr2           210   240           NA      NA    NA
#> 3          3 chr2           380   415            3     300   399
#> 4          4 chr1           230   280            2     200   250
#> 5         NA chr2            NA    NA            4     415   450

源代码：R/join-by.R

相关用法

注：本文由纯净天空筛选整理自Hadley Wickham等大神的英文原创作品 Join specifications。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。