Python pyspark DataFrame.merge用法及代码示例

本文简要介绍 pyspark.pandas.DataFrame.merge 的用法。

用法: DataFrame.merge(right: pyspark.pandas.frame.DataFrame, how: str = 'inner', on: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]], None] = None, left_on: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]], None] = None, right_on: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]], None] = None, left_index: bool = False, right_index: bool = False, suffixes: Tuple[str, str] = '_x', '_y') → pyspark.pandas.frame.DataFrame

将 DataFrame 对象与 database-style 联接合并。

生成的 DataFrame 的索引将是以下之一：：

0…n 如果没有索引用于合并
左侧 DataFrame 的索引(如果仅合并到右侧 DataFrame 的索引)
右侧DataFrame 的索引(如果仅合并到左侧DataFrame 的索引)
如果使用 DataFrames 的索引合并所有涉及的索引
例如如果 left 带有索引 (a, x) 和 right 带有索引 (b, x)，则结果将是索引 (x, a, b)

参数：

right: Object to merge with.：

how: Type of merge to be performed.：

{‘left’, ‘right’, ‘outer’, ‘inner’}，默认 ‘inner’

left：仅使用左帧中的键，类似于 SQL 左外连接；不保存: 与 Pandas 不同的关键顺序。
right：仅使用右框架中的键，类似于 SQL 右外连接；不保存: 与 Pandas 不同的关键顺序。
外部：使用来自两个帧的键并集，类似于 SQL 完全外部联接；排序键: 按字典顺序。
内部：使用来自两个帧的键的交集，类似于 SQL 内部连接；: 不像 Pandas 那样保留左键的顺序。

on: Column or index level names to join on. These must be found in both DataFrames. If on：

是 None 并且不合并索引，则默认为两个 DataFrame 中列的交集。

left_on: Column or index level names to join on in the left DataFrame. Can also：

是左侧 DataFrame 长度的数组或数组列表。这些数组被视为列。

right_on: Column or index level names to join on in the right DataFrame. Can also：

是正确 DataFrame 长度的数组或数组列表。这些数组被视为列。

left_index: Use the index from the left DataFrame as the join key(s). If it is a：

MultiIndex，其他DataFrame(索引或列数)中的键数必须与级别数匹配。

right_index: Use the index from the right DataFrame as the join key. Same caveats as：

left_index。

suffixes: Suffix to apply to overlapping column names in the left and right side,：

分别。

DataFrame: 两个合并对象的DataFrame。

注意：

如 #263 中所述，连接字符串列当前为缺失值返回 None：

而不是 NaN。

例子：

>>> df1 = ps.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
...                     'value': [1, 2, 3, 5]},
...                    columns=['lkey', 'value'])
>>> df2 = ps.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
...                     'value': [5, 6, 7, 8]},
...                    columns=['rkey', 'value'])
>>> df1
  lkey  value
0  foo      1
1  bar      2
2  baz      3
3  foo      5
>>> df2
  rkey  value
0  foo      5
1  bar      6
2  baz      7
3  foo      8

在 lkey 和 rkey 列上合并 df1 和 df2。值列附加了默认后缀 _x 和 _y。

>>> merged = df1.merge(df2, left_on='lkey', right_on='rkey')
>>> merged.sort_values(by=['lkey', 'value_x', 'rkey', 'value_y'])  
  lkey  value_x rkey  value_y
...bar        2  bar        6
...baz        3  baz        7
...foo        1  foo        5
...foo        1  foo        8
...foo        5  foo        5
...foo        5  foo        8

>>> left_psdf = ps.DataFrame({'A': [1, 2]})
>>> right_psdf = ps.DataFrame({'B': ['x', 'y']}, index=[1, 2])

>>> left_psdf.merge(right_psdf, left_index=True, right_index=True).sort_index()
   A  B
1  2  x

>>> left_psdf.merge(right_psdf, left_index=True, right_index=True, how='left').sort_index()
   A     B
0  1  None
1  2     x

>>> left_psdf.merge(right_psdf, left_index=True, right_index=True, how='right').sort_index()
     A  B
1  2.0  x
2  NaN  y

>>> left_psdf.merge(right_psdf, left_index=True, right_index=True, how='outer').sort_index()
     A     B
0  1.0  None
1  2.0     x
2  NaN     y

相关用法

注：本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.pandas.DataFrame.merge。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。

用法:

生成的 DataFrame 的索引将是以下之一：：

参数：

返回：

注意：

如 #263 中所述，连接字符串列当前为缺失值返回 None：

例子：