Python Pandas DataFrame join方法用法及代码示例

Pandas DataFrame.join(~) 将源 DataFrame 与另一个系列或 DataFrame 合并。

注意

join(~) 方法是 merge(~) 方法的包装器，因此如果您想要对联接过程进行更多控制，请考虑使用 merge(~) 代替。

参数

1.other | DataFrame 的 Series 或 DataFrame 或 list

要连接的另一个对象。

2. on | string 或 list | optional

要执行联接的源 DataFrame 的列或索引级别名称。这 index 的other将用于连接。如果要指定要连接的非索引列other，采用DataFrame merge方法相反，它有一个right_on范围。

3. how | string | optional

要执行的连接类型：

值	说明
`"left"`	源 DataFrame 中的所有行都将出现在生成的 DataFrame 中。这相当于 left-join 的 SQL。
`"right"`	右侧 DataFrame 中的所有行都将出现在生成的 DataFrame 中。这相当于 right-join 的 SQL。
`"outer"`	来自源和右侧 DataFrame 的所有行都将出现在生成的 DataFrame 中。这相当于 outer-join 的 SQL。
`"inner"`	在源 DataFrame 中具有匹配值的所有行都将出现在生成的 DataFrame 中。这是相当于 inner-join 的 SQL。

默认情况下，how="left" 。

这是说明差异的经典维恩图：

4. lsuffix | string | optional

附加到源 DataFrame 的重叠标签的后缀。仅当结果中存在重复的列标签时，这才相关。默认情况下，lsuffix="" 。

5. rsuffix | string | optional

附加到 other 的重叠标签的后缀。仅当结果中存在重复的列标签时，这才相关。默认情况下，rsuffix="" 。

6. sort | boolean | optional

是否根据连接键对行进行排序。默认情况下，sort=False 。

返回值

合并的 DataFrame 。

例子

基本用法

考虑以下关于商店的一些products的DataFrame：

df_products = pd.DataFrame({"product": ["computer", "smartphone", "headphones"],
 "bought_by": ["bob", "alex", "bob"]},
 index=["A","B","C"])
df_products



   product      bought_by
A  computer     bob
B  smartphone   alex
C  headphones   bob

这是关于商店的一些customers的DataFrame：

df_customers = pd.DataFrame({"age": [10, 20, 30]},
 index=["alex","bob","cathy"])
df_customers



       age
alex   10
bob    20
cathy  30

要对 df_products 的 bought_by 列执行 left-join：

df_products.join(df_customers, on="bought_by")   # how="left"



   product     bought_by   age
A  computer      bob       20
B  smartphone    alex      10
C  headphones    bob       20

默认情况下，other 的索引将用作连接键。如果您希望更灵活地选择哪些列用于连接，请改用 merge(~) 方法。

不同连接的比较

考虑以下有关产品和客户的DataFrames：

df_products = pd.DataFrame({"product": ["computer", "smartphone", "headphones"],
 "bought_by": ["bob", "alex", "david"]},
 index=["A","B","C"])
df_customers = pd.DataFrame({"age": [10, 20, 30]}, index=["alex","bob","cathy"])



        [df_products]         |   [df_customers]
   product      bought_by     |            age
A  computer     bob           |     alex   10
B  smartphone   alex          |     bob    20
C  headphones   david         |     cathy  30

左连接

df_products.join(df_customers, on="bought_by", how="left")



   product     bought_by  age
A  computer    bob        20.0
B  smartphone  alex       10.0
C  headphones  david      NaN

右连接

df_products.join(df_customers, on="bought_by", how="right")



     product     bought_by  age
B    smartphone  alex       10
A    computer    bob        20
NaN  NaN         cathy      30

内部联接

df_products.join(df_customers, on="bought_by", how="inner")



   product     bought_by  age
A  computer    bob        20
C  headphones  bob        20
B  smartphone  alex       10

外连接

df_products.join(df_customers, on="bought_by", how="outer")



     product     bought_by  age
A    computer    bob        20
C    headphones  bob        20
B    smartphone  alex       10
NaN  NaN         cathy      30

指定 lsuffix 和 rsuffix

假设我们想要加入以下 DataFrame：

df_products = pd.DataFrame({"product": ["computer", "smartphone", "headphones"],
 "age": [5,6,7],
 "bought_by": ["bob", "alex", "bob"]},
 index=["A","B","C"])
df_customers = pd.DataFrame({"age": [10, 20, 30]},
 index=["alex","bob","cathy"])



   product     age  bought_by   |         age
A  computer     5   bob         |  alex   10
B  smartphone   6   alex        |  bob    20
C  headphones   7   bob         |  cathy  30

请注意DataFrames 都有age 列。

由于命名重叠，执行连接会产生 ValueError ：

df_products.join(df_customers, on="bought_by")



ValueError: columns overlap but no suffix specified: Index(['age'], dtype='object')

可以通过指定 lsuffix 或 rsuffix 来解决此错误：

df_products.join(df_customers, on="bought_by", lsuffix="_product")



   product     age_product  bought_by   age
A  computer        5         bob        10
B  smartphone      6         alex       10
C  headphones      7         bob        20

请注意，lsuffix 和 rsuffix 仅在存在重复列标签时才生效。

指定排序

考虑与之前相同的示例：

[df_products]         |   [df_customers]
   product      bought_by     |            age
A  computer     bob           |     alex   10
B  smartphone   alex          |     bob    20
C  headphones   bob           |     cathy  30

默认情况下， sort=False ，这意味着生成的 DataFrame 的行不按连接键 ( bought_by ) 排序：

df_products.join(df_customers, on="bought_by")   # sort=False



   product     bought_by  age
A  computer    bob        20
B  smartphone  alex       10
C  headphones  bob        20

设置 sort=True 会产生：

df_products.join(df_customers, on="bought_by", sort=True)



   product     bought_by  age
B  smartphone  alex       10
A  computer    bob        20
C  headphones  bob        20

相关用法

注：本文由纯净天空筛选整理自Isshin Inada大神的英文原创作品 Pandas DataFrame | join method。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。