Python pyspark merge用法及代碼示例

本文簡要介紹 pyspark.pandas.merge 的用法。

用法: pyspark.pandas.merge(obj: pyspark.pandas.frame.DataFrame, right: pyspark.pandas.frame.DataFrame, how: str = 'inner', on: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]], None] = None, left_on: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]], None] = None, right_on: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]], None] = None, left_index: bool = False, right_index: bool = False, suffixes: Tuple[str, str] = '_x', '_y') → pyspark.pandas.frame.DataFrame

將 DataFrame 對象與 database-style 聯接合並。

生成的 DataFrame 的索引將是以下之一：：

0…n 如果沒有索引用於合並
左側 DataFrame 的索引(如果僅合並到右側 DataFrame 的索引)
右側DataFrame 的索引(如果僅合並到左側DataFrame 的索引)
如果使用 DataFrames 的索引合並所有涉及的索引
例如如果 left 帶有索引 (a, x) 和 right 帶有索引 (b, x)，則結果將是索引 (x, a, b)

參數：

right: Object to merge with.：

how: Type of merge to be performed.：

{‘left’, ‘right’, ‘outer’, ‘inner’}，默認 ‘inner’

left：僅使用左幀中的鍵，類似於 SQL 左外連接；保留 key: 命令。
right：僅使用右框架中的鍵，類似於 SQL 右外連接；保留 key: 命令。
外部：使用來自兩個幀的鍵並集，類似於 SQL 完全外部聯接；排序鍵: 按字典順序。
內部：使用來自兩個幀的鍵的交集，類似於 SQL 內部連接；: 保留左鍵的順序。

on: Column or index level names to join on. These must be found in both DataFrames. If on：

是 None 並且不合並索引，則默認為兩個 DataFrame 中列的交集。

left_on: Column or index level names to join on in the left DataFrame. Can also：

是左側 DataFrame 長度的數組或數組列表。這些數組被視為列。

right_on: Column or index level names to join on in the right DataFrame. Can also：

是正確 DataFrame 長度的數組或數組列表。這些數組被視為列。

left_index: Use the index from the left DataFrame as the join key(s). If it is a：

MultiIndex，其他DataFrame(索引或列數)中的鍵數必須與級別數匹配。

right_index: Use the index from the right DataFrame as the join key. Same caveats as：

left_index。

suffixes: Suffix to apply to overlapping column names in the left and right side,：

分別。

DataFrame: 兩個合並對象的DataFrame。

注意：

如 #263 中所述，連接字符串列當前為缺失值返回 None：

而不是 NaN。

例子：

>>> df1 = ps.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
...                     'value': [1, 2, 3, 5]},
...                    columns=['lkey', 'value'])
>>> df2 = ps.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
...                     'value': [5, 6, 7, 8]},
...                    columns=['rkey', 'value'])
>>> df1
  lkey  value
0  foo      1
1  bar      2
2  baz      3
3  foo      5
>>> df2
  rkey  value
0  foo      5
1  bar      6
2  baz      7
3  foo      8

在 lkey 和 rkey 列上合並 df1 和 df2。值列附加了默認後綴 _x 和 _y。

>>> merged = ps.merge(df1, df2, left_on='lkey', right_on='rkey')
>>> merged.sort_values(by=['lkey', 'value_x', 'rkey', 'value_y'])  
  lkey  value_x rkey  value_y
...bar        2  bar        6
...baz        3  baz        7
...foo        1  foo        5
...foo        1  foo        8
...foo        5  foo        5
...foo        5  foo        8

>>> left_psdf = ps.DataFrame({'A': [1, 2]})
>>> right_psdf = ps.DataFrame({'B': ['x', 'y']}, index=[1, 2])

>>> ps.merge(left_psdf, right_psdf, left_index=True, right_index=True).sort_index()
   A  B
1  2  x

>>> ps.merge(left_psdf, right_psdf, left_index=True, right_index=True, how='left').sort_index()
   A     B
0  1  None
1  2     x

>>> ps.merge(left_psdf, right_psdf, left_index=True, right_index=True, how='right').sort_index()
     A  B
1  2.0  x
2  NaN  y

>>> ps.merge(left_psdf, right_psdf, left_index=True, right_index=True, how='outer').sort_index()
     A     B
0  1.0  None
1  2.0     x
2  NaN     y

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.pandas.merge。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。

用法:

生成的 DataFrame 的索引將是以下之一：：

參數：

返回：

注意：

如 #263 中所述，連接字符串列當前為缺失值返回 None：

例子：