Python pyspark DataFrame.where用法及代码示例

本文简要介绍 pyspark.pandas.DataFrame.where 的用法。

用法: DataFrame.where(cond: Union[DataFrame, Series], other: Union[DataFrame, Series, Any] = nan, axis: Union[int, str] = None) → DataFrame

替换条件为 False 的值。

参数：

cond：布尔值DataFrame: 如果 cond 为 True，则保留原始值。如果为 False，则替换为其他对应的值。
other：标量，DataFrame: cond 为 False 的条目将替换为来自 other 的相应值。
axis：整数，默认无: 为了与 pandas 兼容，目前只能设置为 0。

DataFrame

例子：

>>> from pyspark.pandas.config import set_option, reset_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> df1 = ps.DataFrame({'A': [0, 1, 2, 3, 4], 'B':[100, 200, 300, 400, 500]})
>>> df2 = ps.DataFrame({'A': [0, -1, -2, -3, -4], 'B':[-100, -200, -300, -400, -500]})
>>> df1
   A    B
0  0  100
1  1  200
2  2  300
3  3  400
4  4  500
>>> df2
   A    B
0  0 -100
1 -1 -200
2 -2 -300
3 -3 -400
4 -4 -500

>>> df1.where(df1 > 0).sort_index()
     A      B
0  NaN  100.0
1  1.0  200.0
2  2.0  300.0
3  3.0  400.0
4  4.0  500.0

>>> df1.where(df1 > 1, 10).sort_index()
    A    B
0  10  100
1  10  200
2   2  300
3   3  400
4   4  500

>>> df1.where(df1 > 1, df1 + 100).sort_index()
     A    B
0  100  100
1  101  200
2    2  300
3    3  400
4    4  500

>>> df1.where(df1 > 1, df2).sort_index()
   A    B
0  0  100
1 -1  200
2  2  300
3  3  400
4  4  500

当 cond 的列名与 self 不同时，它认为所有的值都是 False

>>> cond = ps.DataFrame({'C': [0, -1, -2, -3, -4], 'D':[4, 3, 2, 1, 0]}) % 3 == 0
>>> cond
       C      D
0   True  False
1  False   True
2  False  False
3   True  False
4  False   True

>>> df1.where(cond).sort_index()
    A   B
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

当 cond 的类型是 Series 时，它只检查 boolean 而不考虑列名

>>> cond = ps.Series([1, 2]) > 1
>>> cond
0    False
1     True
dtype: bool

>>> df1.where(cond).sort_index()
     A      B
0  NaN    NaN
1  1.0  200.0
2  NaN    NaN
3  NaN    NaN
4  NaN    NaN

>>> reset_option("compute.ops_on_diff_frames")

相关用法

注：本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.pandas.DataFrame.where。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。

用法:

参数：

返回：

例子：