Python pyspark DataFrame.apply用法及代碼示例

本文簡要介紹 pyspark.pandas.DataFrame.apply 的用法。

用法:
DataFrame.apply(func: Callable, axis: Union[int, str] = 0, args: Sequence[Any] =(), **kwds: Any) → Union[Series, DataFrame, Index]

沿 DataFrame 的軸應用函數。

傳遞給函數的對象是 Series 對象，其索引是 DataFrame 的索引 (axis=0) 或 DataFrame 的列 (axis=1)。

另見Transform and apply a function。

注意

當axis 為0 或‘index’ 時，func 無法訪問整個輸入係列。 pandas-on-Spark 在內部將輸入係列拆分為多個批次，並在每個批次中多次調用 func。因此，諸如全局聚合之類的操作是不可能的。請參見下麵的示例。

>>> # This case does not return the length of whole series but of the batch internally
... # used.
... def length(s) -> int:
...     return len(s)
...
>>> df = ps.DataFrame({'A': range(1000)})
>>> df.apply(length, axis=0)  
0     83
1     83
2     83
...
10    83
11    83
dtype: int32

注意

此 API 執行該函數一次以推斷可能昂貴的類型，例如，在聚合或排序後創建數據集時。

為避免這種情況，請將返回類型指定為 Series 或在 func 中指定標量值，例如，如下所示：

>>> def square(s) -> ps.Series[np.int32]:
...     return s ** 2

pandas-on-Spark 使用返回類型提示並且不嘗試推斷類型。

如果axis為1，則需要指定DataFrame或標量值，類型提示如下：

>>> def plus_one(x) -> ps.DataFrame[float, float]:
...     return x + 1

如果返回類型指定為 DataFrame ，則輸出列名稱變為 c0, c1, c2 … cn 。這些名稱按位置映射到 func 中返回的 DataFrame 。

要指定列名，您可以使用 pandas 友好的樣式指定它們，如下所示：

>>> def plus_one(x) -> ps.DataFrame["a": float, "b": float]:
...     return x + 1

>>> pdf = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 4, 5]})
>>> def plus_one(x) -> ps.DataFrame[zip(pdf.dtypes, pdf.columns)]:
...     return x + 1

但是，這種方式會在輸出中將索引類型切換為默認索引類型，因為此時類型提示無法表示索引類型。使用reset_index() 保留索引作為一種解決方法。

當給定函數注釋了返回類型時，DataFrame 的原始索引將丟失，然後將默認索引附加到結果中。請謹慎配置默認索引。另請參閱Default Index Type。

參數：

func：函數

應用於每一列或每一行的函數。

axis：{0 或 ‘index’，1 或 ‘columns’}，默認 0

沿其應用函數的軸：

0 或‘index’：將函數應用於每一列。
1 或‘columns’：將函數應用於每一行。

args：元組

除了數組/係列之外，要傳遞給 func 的位置參數。

**kwds：

附加關鍵字參數作為關鍵字參數傳遞給 func 。

係列或DataFrame: 沿 DataFrame 的給定軸應用func 的結果。

例子：

>>> df = ps.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
>>> df
   A  B
0  4  9
1  4  9
2  4  9

使用 numpy 通用函數(在這種情況下與 np.sqrt(df) 相同)：

>>> def sqrt(x) -> ps.Series[float]:
...     return np.sqrt(x)
...
>>> df.apply(sqrt, axis=0)
     A    B
0  2.0  3.0
1  2.0  3.0
2  2.0  3.0

您可以省略類型提示並讓pandas-on-Spark 推斷其類型。

>>> df.apply(np.sqrt, axis=0)
     A    B
0  2.0  3.0
1  2.0  3.0
2  2.0  3.0

當axis 為 1 或 ‘columns’ 時，它對每一行應用該函數。

>>> def summation(x) -> np.int64:
...     return np.sum(x)
...
>>> df.apply(summation, axis=1)
0    13
1    13
2    13
dtype: int64

同樣，您可以省略類型提示並讓pandas-on-Spark 推斷其類型。

>>> df.apply(np.sum, axis=1)
0    13
1    13
2    13
dtype: int64

>>> df.apply(max, axis=1)
0    9
1    9
2    9
dtype: int64

返回類似列表的結果將是一個係列

>>> df.apply(lambda x: [1, 2], axis=1)
0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

為了在axis 為‘1’ 時指定類型，應使用DataFrame[…] 注釋。在這種情況下，會自動生成列名。

>>> def identify(x) -> ps.DataFrame['A': np.int64, 'B': np.int64]:
...     return x
...
>>> df.apply(identify, axis=1)
   A  B
0  4  9
1  4  9
2  4  9

您還可以指定額外的參數。

>>> def plus_two(a, b, c) -> ps.DataFrame[np.int64, np.int64]:
...     return a + b + c
...
>>> df.apply(plus_two, axis=1, args=(1,), c=3)
   c0  c1
0   8  13
1   8  13
2   8  13

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.pandas.DataFrame.apply。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。

用法:

參數：

返回：

例子：