R dplyr mutate-joins 變異連接

變異連接將列從 y 添加到 x ，根據鍵匹配觀察結果。有四種變異連接：內部連接和三個外部連接。

內部聯接

inner_join() 僅保留 x 中在 y 中具有匹配鍵的觀察結果。

內連接最重要的屬性是任一輸入中不匹配的行不會包含在結果中。這意味著通常內部聯接在大多數分析中並不適用，因為它很容易丟失觀察結果。

外連接

三個外連接保留至少出現在一個 DataFrame 中的觀察結果：

left_join() 將所有觀察結果保留在 x 中。
right_join() 將所有觀察結果保留在 y 中。
full_join() 將所有觀察結果保留在 x 和 y 中。

用法

inner_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL
)

# S3 method for data.frame
inner_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = c("na", "never"),
  multiple = "all",
  unmatched = "drop",
  relationship = NULL
)

left_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL
)

# S3 method for data.frame
left_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = c("na", "never"),
  multiple = "all",
  unmatched = "drop",
  relationship = NULL
)

right_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL
)

# S3 method for data.frame
right_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = c("na", "never"),
  multiple = "all",
  unmatched = "drop",
  relationship = NULL
)

full_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL
)

# S3 method for data.frame
full_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = c("na", "never"),
  multiple = "all",
  relationship = NULL
)

參數

x, y

一對數據幀、數據幀擴展(例如 tibble)或惰性數據幀(例如來自 dbplyr 或 dtplyr)。有關更多詳細信息，請參閱下麵的方法。

by

使用 join_by() 創建的連接規範，或要連接的變量的字符向量。

如果 NULL (默認值)，*_join() 將使用 x 和 y 之間的所有共同變量執行自然連接。一條消息列出了變量，以便您可以檢查它們是否正確；通過顯式提供 by 來抑製該消息。

要連接 x 和 y 之間的不同變量，請使用 join_by() 規範。例如， join_by(a == b) 將匹配 x$a 到 y$b 。

要連接多個變量，請使用帶有多個表達式的 join_by() 規範。例如， join_by(a == b, c == d) 將 x$a 與 y$b 匹配，將 x$c 與 y$d 匹配。如果 x 和 y 之間的列名稱相同，您可以通過僅列出變量名稱來縮短列名稱，例如 join_by(a, c) 。

join_by() 還可用於執行不等式連接、滾動連接和重疊連接。有關這些類型的連接的詳細信息，請參閱?join_by 中的文檔。

對於簡單的等式連接，您也可以指定要連接的變量名稱的字符向量。例如， by = c("a", "b") 將 x$a 連接到 y$a 並將 x$b 連接到 y$b 。如果 x 和 y 之間的變量名稱不同，請使用命名字符向量，例如 by = c("x_a" = "y_a", "x_b" = "y_b") 。

要執行交叉聯接，生成 x 和 y 的所有組合，請參閱 cross_join() 。

copy

如果 x 和 y 不是來自同一個數據源，並且 copy 是 TRUE ，則 y 將被複製到與 x 相同的源中。這允許您跨 src 連接表，但這是一項潛在昂貴的操作，因此您必須選擇它。

suffix

如果 x 和 y 中存在未連接的重複變量，這些後綴將添加到輸出中以消除它們的歧義。應該是長度為 2 的字符向量。

...

傳遞給方法的其他參數。

keep

來自 x 和 y 的連接鍵是否應該保留在輸出中？

如果默認為 NULL ，則等式連接僅保留 x 中的鍵，而不等式連接則保留兩個輸入中的鍵。
如果是 TRUE ，則保留兩個輸入的所有鍵。
如果 FALSE ，則僅保留 x 中的 key 。對於右連接和全連接，與僅存在於 y 中的行對應的鍵列中的數據將合並到 x 中的鍵列中。在加入不平等條件時不能使用。

na_matches

兩個 NA 或兩個 NaN 值應該匹配嗎？

"na" (默認值)將兩個 NA 或兩個 NaN 值視為相等，例如 %in% 、 match() 和 merge() 。
"never" 將兩個 NA 或兩個 NaN 值視為不同的值，並且永遠不會將它們匹配在一起或與任何其他值匹配。這類似於數據庫源的聯接和 base::merge(incomparables = NA) 。

multiple

處理 x 中與 y 中多個匹配的行。對於 x 的每一行：

"all" (默認值)返回 y 中檢測到的每個匹配項。這與 SQL 的行為相同。
"any" 返回在 y 中檢測到的一個匹配項，但不保證將返回哪一個匹配項。如果您隻需要檢測是否至少有一個匹配項，它通常比 "first" 和 "last" 更快。
"first" 返回在 y 中檢測到的第一個匹配項。
"last" 返回在 y 中檢測到的最後一個匹配項。

unmatched

應如何處理會導致刪除行的不匹配鍵？

"drop" 從結果中刪除不匹配的鍵。
如果檢測到不匹配的鍵，"error" 會引發錯誤。

unmatched 旨在防止您在連接期間意外刪除行。它僅檢查輸入中可能會刪除行的不匹配鍵。

對於左連接，它檢查 y 。
對於右連接，它檢查 x 。
對於內部聯接，它檢查 x 和 y 。在這種情況下，unmatched 也可以是長度為 2 的字符向量，以獨立指定 x 和 y 的行為。

relationship

處理 x 和 y 的鍵之間的預期關係。如果從下麵的列表中選擇的期望無效，則會拋出錯誤。

默認情況下 NULL 不希望 x 和 y 之間存在任何關係。但是，對於相等連接，它將檢查多對多關係(這通常是意外的)，並在發生這種情況時發出警告，鼓勵您仔細查看輸入或通過指定 "many-to-many" 來明確此關係。

有關更多詳細信息，請參閱多對多關係部分。
"one-to-one" 期望：
- x 中的每一行最多與 y 中的 1 行匹配。
- y 中的每一行最多與 x 中的 1 行匹配。
"one-to-many" 期望：
- y 中的每一行最多與 x 中的 1 行匹配。
"many-to-one" 期望：
- x 中的每一行最多與 y 中的 1 行匹配。
"many-to-many" 不執行任何關係檢查，但允許您明確了解此關係(如果您知道它存在)。

relationship 不處理零匹配的情況。為此，請參閱unmatched。

值

與x相同類型的對象(包括相同的組)。盡可能保留x 的行和列的順序。輸出具有以下屬性：

行受連接類型影響。
- inner_join() 返回匹配的 x 行。
- left_join() 返回所有x 行。
- right_join() 返回匹配的 x 行，後跟不匹配的 y 行。
- full_join() 返回所有 x 行，後跟不匹配的 y 行。
輸出列包括 x 中的所有列以及 y 中的所有非鍵列。如果 keep = TRUE ，則還包括 y 中的關鍵列。
如果x 和y 中的非鍵列具有相同的名稱，則添加suffix 來消除歧義。如果 keep = TRUE 以及 x 和 y 中的關鍵列具有相同的名稱，則也會添加 suffix 來消除它們的歧義。
如果 keep = FALSE ，則 by 中包含的輸出列將被強製為其在 x 和 y 之間的通用類型。

多對多關係

默認情況下，dplyr 通過拋出警告來防止平等連接中的多對多關係。當以下兩個條件都成立時，就會發生這種情況：

x 中的一行與 y 中的多行匹配。
y 中的一行與 x 中的多行匹配。

這通常令人驚訝，因為大多數聯接涉及一對一、一對多或多對一的關係，並且通常是不正確指定聯接的結果。多對多關係特別有問題，因為它們可能導致從連接返回的行數出現笛卡爾爆炸。

如果需要多對多關係，請通過顯式設置 relationship = "many-to-many" 來消除此警告。

在生產代碼中，最好預先將 relationship 設置為您期望 x 和 y 的鍵之間存在的任何關係，因為如果數據與您的期望不符，這會強製立即發生錯誤。

不等式連接本質上通常會產生多對多關係，因此默認情況下它們不會發出警告，但在指定不等式連接時仍應格外小心，因為它們也有能力返回大量行。

滾動聯接也不會對多對多關係發出警告，但許多滾動聯接遵循多對一關係，因此設置 relationship = "many-to-one" 來強製執行此操作通常很有用。

請注意，在 SQL 中，大多數數據庫提供程序不允許您指定兩個表之間的多對多關係，而是要求您創建第三個聯結表來生成兩個一對多關係。

方法

這些函數是泛型函數，這意味著包可以為其他類提供實現(方法)。有關額外參數和行為差異，請參閱各個方法的文檔。

當前加載的包中可用的方法：

inner_join()：dbplyr(tbl_lazy)、dplyr(data.frame)。
left_join()：dbplyr(tbl_lazy)、dplyr(data.frame)。
right_join()：dbplyr(tbl_lazy)、dplyr(data.frame)。
full_join()：dbplyr(tbl_lazy)、dplyr(data.frame)。

也可以看看

其他連接：cross_join()、filter-joins、nest_join()

例子

band_members %>% inner_join(band_instruments)
#> Joining with `by = join_by(name)`
#> # A tibble: 2 × 3
#>   name  band    plays 
#>   <chr> <chr>   <chr> 
#> 1 John  Beatles guitar
#> 2 Paul  Beatles bass  
band_members %>% left_join(band_instruments)
#> Joining with `by = join_by(name)`
#> # A tibble: 3 × 3
#>   name  band    plays 
#>   <chr> <chr>   <chr> 
#> 1 Mick  Stones  NA    
#> 2 John  Beatles guitar
#> 3 Paul  Beatles bass  
band_members %>% right_join(band_instruments)
#> Joining with `by = join_by(name)`
#> # A tibble: 3 × 3
#>   name  band    plays 
#>   <chr> <chr>   <chr> 
#> 1 John  Beatles guitar
#> 2 Paul  Beatles bass  
#> 3 Keith NA      guitar
band_members %>% full_join(band_instruments)
#> Joining with `by = join_by(name)`
#> # A tibble: 4 × 3
#>   name  band    plays 
#>   <chr> <chr>   <chr> 
#> 1 Mick  Stones  NA    
#> 2 John  Beatles guitar
#> 3 Paul  Beatles bass  
#> 4 Keith NA      guitar

# To suppress the message about joining variables, supply `by`
band_members %>% inner_join(band_instruments, by = join_by(name))
#> # A tibble: 2 × 3
#>   name  band    plays 
#>   <chr> <chr>   <chr> 
#> 1 John  Beatles guitar
#> 2 Paul  Beatles bass  
# This is good practice in production code

# Use an equality expression if the join variables have different names
band_members %>% full_join(band_instruments2, by = join_by(name == artist))
#> # A tibble: 4 × 3
#>   name  band    plays 
#>   <chr> <chr>   <chr> 
#> 1 Mick  Stones  NA    
#> 2 John  Beatles guitar
#> 3 Paul  Beatles bass  
#> 4 Keith NA      guitar
# By default, the join keys from `x` and `y` are coalesced in the output; use
# `keep = TRUE` to keep the join keys from both `x` and `y`
band_members %>%
  full_join(band_instruments2, by = join_by(name == artist), keep = TRUE)
#> # A tibble: 4 × 4
#>   name  band    artist plays 
#>   <chr> <chr>   <chr>  <chr> 
#> 1 Mick  Stones  NA     NA    
#> 2 John  Beatles John   guitar
#> 3 Paul  Beatles Paul   bass  
#> 4 NA    NA      Keith  guitar

# If a row in `x` matches multiple rows in `y`, all the rows in `y` will be
# returned once for each matching row in `x`.
df1 <- tibble(x = 1:3)
df2 <- tibble(x = c(1, 1, 2), y = c("first", "second", "third"))
df1 %>% left_join(df2)
#> Joining with `by = join_by(x)`
#> # A tibble: 4 × 2
#>       x y     
#>   <dbl> <chr> 
#> 1     1 first 
#> 2     1 second
#> 3     2 third 
#> 4     3 NA    

# If a row in `y` also matches multiple rows in `x`, this is known as a
# many-to-many relationship, which is typically a result of an improperly
# specified join or some kind of messy data. In this case, a warning is
# thrown by default:
df3 <- tibble(x = c(1, 1, 1, 3))
df3 %>% left_join(df2)
#> Joining with `by = join_by(x)`
#> Warning: Detected an unexpected many-to-many relationship between `x` and `y`.
#> ℹ Row 1 of `x` matches multiple rows in `y`.
#> ℹ Row 1 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set `relationship =
#>   "many-to-many"` to silence this warning.
#> # A tibble: 7 × 2
#>       x y     
#>   <dbl> <chr> 
#> 1     1 first 
#> 2     1 second
#> 3     1 first 
#> 4     1 second
#> 5     1 first 
#> 6     1 second
#> 7     3 NA    

# In the rare case where a many-to-many relationship is expected, set
# `relationship = "many-to-many"` to silence this warning
df3 %>% left_join(df2, relationship = "many-to-many")
#> Joining with `by = join_by(x)`
#> # A tibble: 7 × 2
#>       x y     
#>   <dbl> <chr> 
#> 1     1 first 
#> 2     1 second
#> 3     1 first 
#> 4     1 second
#> 5     1 first 
#> 6     1 second
#> 7     3 NA    

# Use `join_by()` with a condition other than `==` to perform an inequality
# join. Here we match on every instance where `df1$x > df2$x`.
df1 %>% left_join(df2, join_by(x > x))
#> # A tibble: 6 × 3
#>     x.x   x.y y     
#>   <int> <dbl> <chr> 
#> 1     1    NA NA    
#> 2     2     1 first 
#> 3     2     1 second
#> 4     3     1 first 
#> 5     3     1 second
#> 6     3     2 third 

# By default, NAs match other NAs so that there are two
# rows in the output of this join:
df1 <- data.frame(x = c(1, NA), y = 2)
df2 <- data.frame(x = c(1, NA), z = 3)
left_join(df1, df2)
#> Joining with `by = join_by(x)`
#>    x y z
#> 1  1 2 3
#> 2 NA 2 3

# You can optionally request that NAs don't match, giving a
# a result that more closely resembles SQL joins
left_join(df1, df2, na_matches = "never")
#> Joining with `by = join_by(x)`
#>    x y  z
#> 1  1 2  3
#> 2 NA 2 NA

源代碼：R/join.R

相關用法

注：本文由純淨天空篩選整理自Hadley Wickham等大神的英文原創作品 Mutating joins。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。