Python pyspark Series.value_counts用法及代碼示例

本文簡要介紹 pyspark.pandas.Series.value_counts 的用法。

用法: Series.value_counts(normalize: bool = False, sort: bool = True, ascending: bool = False, bins: None = None, dropna: bool = True) → Series

返回一個包含唯一值計數的係列。結果對象將按降序排列，因此第一個元素是最多的 frequently-occurring 元素。默認情況下排除 NA 值。

參數：

normalize：布爾值，默認為 False: 如果為 True，則返回的對象將包含唯一值的相對頻率。
sort：布爾值，默認 True: 按值排序。
ascending：布爾值，默認為 False: 按升序排列。
bins：尚不支持
dropna：布爾值，默認 True: 不包括 NaN 的計數。

counts：Series

例子：

對於係列

>>> df = ps.DataFrame({'x':[0, 0, 1, 1, 1, np.nan]})
>>> df.x.value_counts()  
1.0    3
0.0    2
Name: x, dtype: int64

將 normalize 設置為 True 時，通過將所有值除以值的總和來返回相對頻率。

>>> df.x.value_counts(normalize=True)  
1.0    0.6
0.0    0.4
Name: x, dtype: float64

dropna和dropna調成False我們還可以看到NaN索引值。

>>> df.x.value_counts(dropna=False)  
1.0    3
0.0    2
NaN    1
Name: x, dtype: int64

對於索引

>>> idx = ps.Index([3, 1, 2, 3, 4, np.nan])
>>> idx
Float64Index([3.0, 1.0, 2.0, 3.0, 4.0, nan], dtype='float64')

>>> idx.value_counts().sort_index()
1.0    1
2.0    1
3.0    2
4.0    1
dtype: int64

種類

將 sort 設置為 False 時，結果不會按計數排序。

>>> idx.value_counts(sort=True).sort_index()
1.0    1
2.0    1
3.0    2
4.0    1
dtype: int64

標準化

將 normalize 設置為 True 時，通過將所有值除以值的總和來返回相對頻率。

>>> idx.value_counts(normalize=True).sort_index()
1.0    0.2
2.0    0.2
3.0    0.4
4.0    0.2
dtype: float64

dropna

將 dropna 設置為 False 後，我們還可以看到 NaN 索引值。

>>> idx.value_counts(dropna=False).sort_index()  
1.0    1
2.0    1
3.0    2
4.0    1
NaN    1
dtype: int64

對於多索引。

>>> midx = pd.MultiIndex([['lama', 'cow', 'falcon'],
...                       ['speed', 'weight', 'length']],
...                      [[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                       [1, 1, 1, 1, 1, 2, 1, 2, 2]])
>>> s = ps.Series([45, 200, 1.2, 30, 250, 1.5, 320, 1, 0.3], index=midx)
>>> s.index  
MultiIndex([(  'lama', 'weight'),
            (  'lama', 'weight'),
            (  'lama', 'weight'),
            (   'cow', 'weight'),
            (   'cow', 'weight'),
            (   'cow', 'length'),
            ('falcon', 'weight'),
            ('falcon', 'length'),
            ('falcon', 'length')],
           )

>>> s.index.value_counts().sort_index()
(cow, length)       1
(cow, weight)       2
(falcon, length)    2
(falcon, weight)    1
(lama, weight)      3
dtype: int64

>>> s.index.value_counts(normalize=True).sort_index()
(cow, length)       0.111111
(cow, weight)       0.222222
(falcon, length)    0.222222
(falcon, weight)    0.111111
(lama, weight)      0.333333
dtype: float64

如果索引有名稱，請保持名稱。

>>> idx = ps.Index([0, 0, 0, 1, 1, 2, 3], name='pandas-on-Spark')
>>> idx.value_counts().sort_index()
0    3
1    2
2    1
3    1
Name: pandas-on-Spark, dtype: int64

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.pandas.Series.value_counts。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。

用法:

參數：

返回：

例子：