Python Pandas DataFrame merge方法用法及代码示例

Pandas DataFrame.merge(~) 方法将源 DataFrame 与另一个 DataFrame 或命名系列合并。

参数

1.right | DataFrame 或 named Series

要合并源 DataFrame 的 DataFrame 或系列。

2. how | string | optional

要执行的合并类型：

值	说明
`"left"`	源 DataFrame 中的所有行都将出现在生成的 DataFrame 中。这相当于 left-join 的 SQL。
`"right"`	右侧 DataFrame 中的所有行都将出现在生成的 DataFrame 中。这相当于 right-join 的 SQL。
`"outer"`	来自源和右侧 DataFrame 的所有行都将出现在生成的 DataFrame 中。这相当于 outer-join 的 SQL。
`"inner"`	在源 DataFrame 中具有匹配值的所有行都将出现在生成的 DataFrame 中。这是相当于 inner-join 的 SQL。

默认情况下，how="inner" 。

这是说明差异的经典维恩图：

3. on | string 或 list | optional

要执行连接的列或 index-levels 的标签。仅当左侧和右侧 DataFrames 具有相同标签时才有效。默认情况下， on=None ，这意味着将执行内部联接。

注意

on 参数只是为了方便起见。如果要连接的列具有不同的标签，则必须使用 left_on 、 right_on 、 left_index 和 right_index 代替。

4. left_on | string 或 array-like | optional

要执行连接的左侧 DataFrame 的列标签或索引级别。

5. right_on | string 或 array-like | optional

要执行连接的右侧 DataFrame 的列标签或索引级别。

6. left_index | boolean | optional

是否对左侧 DataFrame 的索引执行连接。默认情况下，left_index=False 。

7. right_index | boolean | optional

是否对右侧 DataFrame 的索引执行连接。默认情况下，right_index=False 。

注意

许多教科书和文档都使用这些词合并键或者连接键表示执行连接的列。

8. sort | boolean | optional

是否根据连接键对行进行排序。默认情况下，sort=False 。

9. suffixes | (string, string) 的tuple | optional

要附加到生成的 DataFrame 中的重复列标签的后缀名称。默认情况下，suffixes=("_x", "_y") 。

10.copy | boolean | optional

如果 True ，则返回 DataFrame 的新副本。
如果是 False ，则尽可能避免创建新副本。

默认情况下，copy=True 。

11.indicator | boolean 或 string | optional

是否添加一个名为 _merge 的额外列，它告诉我们该行是从哪个 DataFrame 构造的。默认情况下，indicator=False 。

12.validate | string | optional

要运行的验证逻辑：

值	说明
`"one_to_one"` 或 `"1:1"`	检查左右合并键是否唯一。
`"one_to_many"` 或 `"1:m"`	检查左侧的合并键是否唯一。
`"many_to_one"` 或 `"m:1"`	检查右侧的合并键是否唯一。
`"many_to_many"` 或 `"m:m"`	不执行任何检查。

默认情况下，validate=None 。

返回值

合并的 DataFrame。

例子

执行inner-join

假设一个店主有以下数据：

ID	产品	bought_by
A	computer	1
B	smartphone	3
C	headphones	NaN

ID	名字	年龄
1	alex	10
2	bob	20
3	cathy	30

在这里，顶部表格是关于商店拥有的产品，而底部表格是关于注册客户的资料。

假设我们想查看购买了产品的客户的个人资料。为此，我们需要对 products 表的 bought_by 列和 customers 表的索引标签执行内部联接。

首先，让我们创建两个 DataFrame。这是products 数据帧：

df_products = pd.DataFrame({"product": ["computer", "smartphone", "headphones"],
 "bought_by": [1, 3, pd.np.NaN]},
 index=["A","B","C"])
df_products



   product     bought_by
A  computer      1.0
B  smartphone    3.0
C  headphones    NaN

这是customers 数据帧：

df_customers = pd.DataFrame({"name": ["alex", "bob", "cathy"],
 "age": [10, 20, 30]},
 index=[1,2,3])
df_customers



   name   age
1  alex   10
2  bob    20
3  cathy  30

现在，执行inner-join：

df_products.merge(df_customers, how="inner", left_on="bought_by", right_index=True)



   product    bought_by  name   age
A  computer     1.0      alex   10
B  smartphone   3.0      cathy  30

请注意以下事项：

left_on="bought_by" 表示我们要使用左侧 DataFrame 的 bought_by 列进行对齐。
right_index=True 表示我们要使用右侧 DataFrame 的索引进行对齐。我们还可以通过指定 right_on 参数来基于列进行合并。
您可以在此处省略 how 参数，因为默认值为 "inner" 。
请注意列的顺序 - 源 DataFrame 的列(即 df_products )首先出现，然后是正确 DataFrame 的列。

执行left-join

为了演示left-join，我们将使用与之前相同的产品和客户示例：

[df_products]         |     [df_customers]
   product      bought_by     |        name   age
A  computer       1.0         |     1  alex   10
B  smartphone     3.0         |     2  bob    20
C  headphones     NaN         |     3  cathy  30

要查看所有产品以及购买该产品的客户的个人资料，我们需要执行left-join，如下所示：

df_products.merge(df_customers, how="left", left_on="bought_by", right_index=True)



   product    bought_by  name   age
A  computer     1.0      alex   10.0
B  smartphone   3.0      cathy  30.0
C  headphones   NaN      NaN    NaN

left_on="bought_by" 表示我们要使用左侧 DataFrame 的 bought_by 列进行对齐。
right_index=True 表示我们要使用右侧 DataFrame 的索引进行对齐。
请注意列的顺序 - 源 DataFrame 的列(即 df_products )首先出现，然后是正确 DataFrame 的列。

执行right-join

为了演示right-join，我们将使用与之前相同的产品和客户示例：

[df_products]         |     [df_customers]
   product      bought_by     |        name   age
A  computer       1.0         |     1  alex   10
B  smartphone     3.0         |     2  bob    20
C  headphones     NaN         |     3  cathy  30

要查看所有客户的个人资料以及他们购买的产品，我们需要执行right-join，如下所示：

df_products.merge(df_customers, how="right", left_on="bought_by", right_index=True)



     product    bought_by  name   age
A    computer     1.0      alex   10
B    smartphone   3.0      cathy  30
NaN  NaN          2.0      bob    20

执行full-join

为了演示full-join，我们将使用与之前相同的产品和客户示例：

[df_products]         |     [df_customers]
   product      bought_by     |        name   age
A  computer       1.0         |     1  alex   10
B  smartphone     3.0         |     2  bob    20
C  headphones     NaN         |     3  cathy  30

要查看所有客户以及所有产品的资料，我们需要执行full-join，如下所示：

df_products.merge(df_customers, how="outer", left_on="bought_by", right_index=True)



     product    bought_by  name    age
A    computer     1.0      alex    10.0
B    smartphone   3.0      cathy   30.0
C    headphones   NaN      NaN     NaN
NaN  NaN          2.0      bob     20.0

指定排序参数

为了演示 sort 参数的作用，我们将使用与之前相同的产品和客户示例：

[df_products]         |     [df_customers]
   product      bought_by     |        name   age
A  computer       3.0         |     1  alex   10
B  smartphone     1.0         |     2  bob    20
C  headphones     NaN         |     3  cathy  30

附带说明一下，bought_by 列中的值的顺序已交换，因为这将使我们能够看到 sort=True 的行为。

假设我们执行了left-join，如下所示：

df_products.merge(df_customers, how="left", left_on="bought_by", right_index=True, sort=True)



   product     bought_by  name   age
B  smartphone    1.0      alex   10.0
A  computer      3.0      cathy  30.0
C  headphones    NaN      NaN    NaN

请注意生成的 DataFrame 如何根据连接键(bought_by 列)进行排序。如果没有 sort=True ，该列将不会被排序，即，将保留原始顺序。

指定后缀参数

为了演示 suffix 参数的作用，我们将使用与之前相同的产品和客户示例：

[df_products]           |     [df_customers]
   name      bought_by        |        name   age
A  computer       1.0         |     1  alex   10
B  smartphone     3.0         |     2  bob    20
C  headphones     NaN         |     3  cathy  30

请注意，之前我们使用列标签 product 作为产品名称，但在本例中，我们将使用列标签 name。现在，我们的列名称有重叠。

要了解 Pandas 默认情况下如何处理此问题，让我们执行inner-join：

df_products.merge(df_customers, how="inner", left_on="bought_by", right_index=True)



   name_x     bought_by  name_y  age
A  computer     1.0      alex    10
B  smartphone   3.0      cathy   30

我们看到 Pandas 添加了后缀 _x 和 _y 来区分列。

我们可以通过指定 suffixes 参数来覆盖此行为，该参数仅接受后缀元组：

df_products.merge(df_customers, how="inner", left_on="bought_by", right_index=True, suffixes=("_product", "_customer"))



   name_product  bought_by  name_customer  age
A  computer        1.0          alex       10
B  smartphone      3.0          cathy      30

观察列名现在如何反映我们指定的后缀。

指定指标参数

为了演示 indicator 参数的作用，我们将使用与之前相同的产品和客户示例：

[df_products]         |     [df_customers]
   product      bought_by     |        name   age
A  computer       1.0         |     1  alex   10
B  smartphone     3.0         |     2  bob    20
C  headphones     NaN         |     3  cathy  30

假设我们使用 indicator=True 执行左连接：

df_products.merge(df_customers, how="left", left_on="bought_by", right_index=True, indicator=True)



   product    bought_by  name    age    _merge
A  computer      1.0     alex    10.0   both
B  smartphone    3.0     cathy   30.0   both
C  headphones    NaN     NaN     NaN    left_only

请注意以下事项：

请注意我们如何在末尾添加一个名为 _merge 的附加列。此列告诉我们该行是由哪个 DataFrame 构造的。
前两行的值为 "both" ，这意味着在源 DataFrame 和 right DataFrame 中都找到了匹配项。通过回顾我们的两个 DataFrame，您可以轻松确认这是True。
最后一行的值为 "left_only" ，这意味着该行来自源 DataFrame。

指定验证参数

我们将再次使用有关产品和客户的相同示例：

[df_products]         |     [df_customers]
   product      bought_by     |        name   age
A  computer       1.0         |     1  alex   10
B  smartphone     3.0         |     2  bob    20
C  headphones     3.0         |     3  cathy  30

在此示例中，我们将耳机的 "bought_by" 值从 NaN 更改为 3 。

让我们对客户 ID 执行inner-join：

df_products.merge(df_customers, how="inner", right_index=True, left_on="bought_by")



   product    bought_by  name   age
A  computer       3      cathy  30
C  headphones     3      cathy  30
B  smartphone     1      alex   10

这里，一位客户 (Cathy) 购买了 2 件产品，因此这是一个 1:2 映射，通常写为 1:m，其中 m 只代表大于 1 的数字。

现在让我们调用完全相同的函数，但使用 validate="1:1" ：

df_products.merge(df_customers, how="inner", right_index=True, left_on="bought_by", validate="1:1")



MergeError: Merge keys are not unique in left dataset; not a one-to-one merge

由于这不是 1:1 映射，因此会引发错误。更一般地说，如果左侧 DataFrame 具有重复值，则验证规则 "1:1" 和 "1:m" 将引发错误。

相关用法

注：本文由纯净天空筛选整理自Isshin Inada大神的英文原创作品 Pandas DataFrame | merge method。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。