R dplyr join_by 加盟規格

join_by() 構建了一個規範，說明如何使用小型域特定語言連接兩個表。結果可以作為 by 參數提供給任何連接函數(例如 left_join() )。

用法

join_by(...)

參數

...

指定連接的表達式。

每個表達式應包含以下內容之一：

相等條件：==
不等式條件： >= 、 > 、 <= 或 <
滾動助手：closest()
重疊助手：between()、within() 或 overlaps()

不支持其他表達式。如果您需要對計算變量執行聯接，例如join_by(sales_date - 40 >= promo_date) ，您需要預先計算並將其存儲在單獨的列中。

列名稱應指定為帶引號或不帶引號的名稱。默認情況下，聯接條件左側的名稱指的是左側表，除非通過在列名稱前顯式添加 x$ 或 y$ 來覆蓋。

如果在沒有任何連接條件的情況下提供單個列名，則會將其解釋為該列名在 == 的每一側重複，即 x 被解釋為 x == x 。

連接類型

dplyr 支持以下類型的聯接：

平等加盟
不平等加入
滾動連接
重疊連接
交叉連接

下麵將更詳細地討論相等、不等、滾動和重疊連接。交叉連接是通過 cross_join() 實現的。

平等加盟

相等連接要求一對或多對列之間的鍵相等，並且是最常見的連接類型。要使用 join_by() 構造相等聯接，請提供兩個要聯接的列名稱，並用 == 分隔。或者，提供單個名稱將被解釋為同名的兩列之間的相等連接。例如， join_by(x) 相當於 join_by(x == x) 。

不平等加入

不等式連接匹配不等式，例如 > 、 >= 、 < 或 <= ，在時間序列分析和基因組學中很常見。要使用 join_by() 構造不等式聯接，請提供由上述不等式之一分隔的兩個列名稱。

請注意，不等式連接會將 x 中的單行與 y 中的潛在大量行進行匹配。構建不等式連接規範時要格外小心！

滾動連接

滾動連接是不等式連接的一種變體，它限製從不等式連接條件返回的結果。當沒有精確匹配時，它們對於"rolling"向前/向後最接近的匹配很有用。要構造滾動連接，請用 closest() 包裝不等式。

closest(expr)

expr 必須是涉及以下之一的不等式： > 、 >= 、 < 或 <= 。

例如，closest(x >= y) 解釋為：對於 x 中的每個值，在 y 中查找小於或等於 x 值的最接近值。

closest() 將始終使用左側表 ( x ) 作為主表，並使用右側表 ( y ) 作為查找最接近匹配項的表，無論如何指定不等式。例如， closest(y$a >= x$b) 將始終被解釋為 closest(x$b <= y$a) 。

重疊連接

重疊連接是不等式連接的一種特殊情況，涉及左表中的一列或兩列與右表中的兩列定義的範圍重疊。 join_by() 識別出三個幫助器來幫助構建重疊連接，所有這些都可以通過更簡單的不等式來構建。

between(x, y_lower, y_upper, ..., bounds = "[]")

對於 x 中的每個值，這會找到該值落在 [y_lower, y_upper] 之間的所有位置。默認相當於x >= y_lower, x <= y_upper。

bounds 可以是 "[]" 、 "[)" 、 "(]" 或 "()" 之一，以更改下限和上限的包含性。這會改變 >= 或 > 以及 <= 或 < 用於構建上麵所示的不等式。

點用於將來的擴展，並且必須為空。
within(x_lower, x_upper, y_lower, y_upper)

對於 [x_lower, x_upper] 中的每個範圍，這會發現該範圍完全落在 [y_lower, y_upper] 內。相當於x_lower >= y_lower, x_upper <= y_upper。

無論提供的範圍是否包含在內，用於構建 within() 的不等式都是相同的。
overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]")

對於 [x_lower, x_upper] 中的每個範圍，這會發現該範圍在任何容量中都與 [y_lower, y_upper] 重疊。默認相當於x_lower <= y_upper, x_upper >= y_lower。

bounds 可以是 "[]" 、 "[)" 、 "(]" 或 "()" 之一，以更改下限和上限的包含性。 "[]" 使用 <= 和 >= ，但其他 3 個選項使用 < 和 > 並生成完全相同的不等式。

點用於將來的擴展，並且必須為空。

這些條件假設範圍格式正確且非空，即當邊界被視為 "[]" 時為 x_lower <= x_upper ，否則為 x_lower < x_upper 。

列引用

指定連接條件時，join_by() 假定條件左側的列名稱引用左側表 ( x )，條件右側的列名稱引用右側表 watch (y)。有時，能夠在條件的左側指定右側表名稱會更清晰，反之亦然。為了支持這一點，列名可以以 x$ 或 y$ 為前綴，以明確指定它們來自哪個表。

例子

sales <- tibble(
  id = c(1L, 1L, 1L, 2L, 2L),
  sale_date = as.Date(c("2018-12-31", "2019-01-02", "2019-01-05", "2019-01-04", "2019-01-01"))
)
sales
#> # A tibble: 5 × 2
#>      id sale_date 
#>   <int> <date>    
#> 1     1 2018-12-31
#> 2     1 2019-01-02
#> 3     1 2019-01-05
#> 4     2 2019-01-04
#> 5     2 2019-01-01

promos <- tibble(
  id = c(1L, 1L, 2L),
  promo_date = as.Date(c("2019-01-01", "2019-01-05", "2019-01-02"))
)
promos
#> # A tibble: 3 × 2
#>      id promo_date
#>   <int> <date>    
#> 1     1 2019-01-01
#> 2     1 2019-01-05
#> 3     2 2019-01-02

# Match `id` to `id`, and `sale_date` to `promo_date`
by <- join_by(id, sale_date == promo_date)
left_join(sales, promos, by)
#> # A tibble: 5 × 2
#>      id sale_date 
#>   <int> <date>    
#> 1     1 2018-12-31
#> 2     1 2019-01-02
#> 3     1 2019-01-05
#> 4     2 2019-01-04
#> 5     2 2019-01-01

# For each `sale_date` within a particular `id`,
# find all `promo_date`s that occurred before that particular sale
by <- join_by(id, sale_date >= promo_date)
left_join(sales, promos, by)
#> # A tibble: 6 × 3
#>      id sale_date  promo_date
#>   <int> <date>     <date>    
#> 1     1 2018-12-31 NA        
#> 2     1 2019-01-02 2019-01-01
#> 3     1 2019-01-05 2019-01-01
#> 4     1 2019-01-05 2019-01-05
#> 5     2 2019-01-04 2019-01-02
#> 6     2 2019-01-01 NA        

# For each `sale_date` within a particular `id`,
# find only the closest `promo_date` that occurred before that sale
by <- join_by(id, closest(sale_date >= promo_date))
left_join(sales, promos, by)
#> # A tibble: 5 × 3
#>      id sale_date  promo_date
#>   <int> <date>     <date>    
#> 1     1 2018-12-31 NA        
#> 2     1 2019-01-02 2019-01-01
#> 3     1 2019-01-05 2019-01-05
#> 4     2 2019-01-04 2019-01-02
#> 5     2 2019-01-01 NA        

# If you want to disallow exact matching in rolling joins, use `>` rather
# than `>=`. Note that the promo on `2019-01-05` is no longer considered the
# closest match for the sale on the same date.
by <- join_by(id, closest(sale_date > promo_date))
left_join(sales, promos, by)
#> # A tibble: 5 × 3
#>      id sale_date  promo_date
#>   <int> <date>     <date>    
#> 1     1 2018-12-31 NA        
#> 2     1 2019-01-02 2019-01-01
#> 3     1 2019-01-05 2019-01-01
#> 4     2 2019-01-04 2019-01-02
#> 5     2 2019-01-01 NA        

# Same as before, but also require that the promo had to occur at most 1
# day before the sale was made. We'll use a full join to see that id 2's
# promo on `2019-01-02` is no longer matched to the sale on `2019-01-04`.
sales <- mutate(sales, sale_date_lower = sale_date - 1)
by <- join_by(id, closest(sale_date >= promo_date), sale_date_lower <= promo_date)
full_join(sales, promos, by)
#> # A tibble: 6 × 4
#>      id sale_date  sale_date_lower promo_date
#>   <int> <date>     <date>          <date>    
#> 1     1 2018-12-31 2018-12-30      NA        
#> 2     1 2019-01-02 2019-01-01      2019-01-01
#> 3     1 2019-01-05 2019-01-04      2019-01-05
#> 4     2 2019-01-04 2019-01-03      NA        
#> 5     2 2019-01-01 2018-12-31      NA        
#> 6     2 NA         NA              2019-01-02

# ---------------------------------------------------------------------------

segments <- tibble(
  segment_id = 1:4,
  chromosome = c("chr1", "chr2", "chr2", "chr1"),
  start = c(140, 210, 380, 230),
  end = c(150, 240, 415, 280)
)
segments
#> # A tibble: 4 × 4
#>   segment_id chromosome start   end
#>        <int> <chr>      <dbl> <dbl>
#> 1          1 chr1         140   150
#> 2          2 chr2         210   240
#> 3          3 chr2         380   415
#> 4          4 chr1         230   280

reference <- tibble(
  reference_id = 1:4,
  chromosome = c("chr1", "chr1", "chr2", "chr2"),
  start = c(100, 200, 300, 415),
  end = c(150, 250, 399, 450)
)
reference
#> # A tibble: 4 × 4
#>   reference_id chromosome start   end
#>          <int> <chr>      <dbl> <dbl>
#> 1            1 chr1         100   150
#> 2            2 chr1         200   250
#> 3            3 chr2         300   399
#> 4            4 chr2         415   450

# Find every time a segment `start` falls between the reference
# `[start, end]` range.
by <- join_by(chromosome, between(start, start, end))
full_join(segments, reference, by)
#> # A tibble: 5 × 7
#>   segment_id chromosome start.x end.x reference_id start.y end.y
#>        <int> <chr>        <dbl> <dbl>        <int>   <dbl> <dbl>
#> 1          1 chr1           140   150            1     100   150
#> 2          2 chr2           210   240           NA      NA    NA
#> 3          3 chr2           380   415            3     300   399
#> 4          4 chr1           230   280            2     200   250
#> 5         NA chr2            NA    NA            4     415   450

# If you wanted the reference columns first, supply `reference` as `x`
# and `segments` as `y`, then explicitly refer to their columns using `x$`
# and `y$`.
by <- join_by(chromosome, between(y$start, x$start, x$end))
full_join(reference, segments, by)
#> # A tibble: 5 × 7
#>   reference_id chromosome start.x end.x segment_id start.y end.y
#>          <int> <chr>        <dbl> <dbl>      <int>   <dbl> <dbl>
#> 1            1 chr1           100   150          1     140   150
#> 2            2 chr1           200   250          4     230   280
#> 3            3 chr2           300   399          3     380   415
#> 4            4 chr2           415   450         NA      NA    NA
#> 5           NA chr2            NA    NA          2     210   240

# Find every time a segment falls completely within a reference.
# Sometimes using `x$` and `y$` makes your intentions clearer, even if they
# match the default behavior.
by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, by)
#> # A tibble: 1 × 7
#>   segment_id chromosome start.x end.x reference_id start.y end.y
#>        <int> <chr>        <dbl> <dbl>        <int>   <dbl> <dbl>
#> 1          1 chr1           140   150            1     100   150

# Find every time a segment overlaps a reference in any way.
by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))
full_join(segments, reference, by)
#> # A tibble: 5 × 7
#>   segment_id chromosome start.x end.x reference_id start.y end.y
#>        <int> <chr>        <dbl> <dbl>        <int>   <dbl> <dbl>
#> 1          1 chr1           140   150            1     100   150
#> 2          2 chr2           210   240           NA      NA    NA
#> 3          3 chr2           380   415            3     300   399
#> 4          3 chr2           380   415            4     415   450
#> 5          4 chr1           230   280            2     200   250

# It is common to have right-open ranges with bounds like `[)`, which would
# mean an end value of `415` would no longer overlap a start value of `415`.
# Setting `bounds` allows you to compute overlaps with those kinds of ranges.
by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end, bounds = "[)"))
full_join(segments, reference, by)
#> # A tibble: 5 × 7
#>   segment_id chromosome start.x end.x reference_id start.y end.y
#>        <int> <chr>        <dbl> <dbl>        <int>   <dbl> <dbl>
#> 1          1 chr1           140   150            1     100   150
#> 2          2 chr2           210   240           NA      NA    NA
#> 3          3 chr2           380   415            3     300   399
#> 4          4 chr1           230   280            2     200   250
#> 5         NA chr2            NA    NA            4     415   450

源代碼：R/join-by.R

相關用法

注：本文由純淨天空篩選整理自Hadley Wickham等大神的英文原創作品 Join specifications。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。