Python pyspark DataFrame.spark.cache用法及代碼示例

本文簡要介紹 pyspark.pandas.DataFrame.spark.cache 的用法。

用法: spark.cache() → CachedDataFrame

產生並緩存當前的 DataFrame。

pandas-on-Spark DataFrame 作為受保護資源生成，其相應的數據被緩存，在上下文執行結束後，這些數據將被取消緩存。

如果要手動指定StorageLevel，請使用DataFrame.spark.persist()

例子：

>>> df = ps.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
...                   columns=['dogs', 'cats'])
>>> df
   dogs  cats
0   0.2   0.3
1   0.0   0.6
2   0.6   0.0
3   0.2   0.1

>>> with df.spark.cache() as cached_df:
...     print(cached_df.count())
...
dogs    4
cats    4
dtype: int64

>>> df = df.spark.cache()
>>> df.to_pandas().mean(axis=1)
0    0.25
1    0.30
2    0.30
3    0.15
dtype: float64

要取消緩存數據幀，請使用 unpersist 函數

>>> df.spark.unpersist()

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.pandas.DataFrame.spark.cache。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。