Python pyspark Accumulator用法及代碼示例

本文簡要介紹 pyspark.Accumulator 的用法。

用法: class pyspark.Accumulator(aid, value, accum_param)

可以累積的共享變量，即具有交換和關聯“add” 操作。 Spark 集群上的工作任務可以使用 += 運算符將值添加到累加器，但隻有驅動程序可以使用 value 訪問其值。來自工作人員的更新會自動傳播到驅動程序。

雖然 SparkContext 支持諸如 int 和 float 等原始數據類型的累加器，但用戶還可以通過提供自定義 AccumulatorParam 對象來為自定義類型定義累加器。有關示例，請參閱其 doctest。

例子：

>>> a = sc.accumulator(1)
>>> a.value
1
>>> a.value = 2
>>> a.value
2
>>> a += 5
>>> a.value
7
>>> sc.accumulator(1.0).value
1.0
>>> sc.accumulator(1j).value
1j
>>> rdd = sc.parallelize([1,2,3])
>>> def f(x):
...     global a
...     a += x
>>> rdd.foreach(f)
>>> a.value
13
>>> b = sc.accumulator(0)
>>> def g(x):
...     b.add(x)
>>> rdd.foreach(g)
>>> b.value
6

>>> rdd.map(lambda x: a.value).collect() 
Traceback (most recent call last):
    ...
Py4JJavaError: ...

>>> def h(x):
...     global a
...     a.value = 7
>>> rdd.foreach(h) 
Traceback (most recent call last):
    ...
Py4JJavaError: ...

>>> sc.accumulator([1.0, 2.0, 3.0]) 
Traceback (most recent call last):
    ...
TypeError: ...

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.Accumulator。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。