Python pyspark FPGrowth用法及代碼示例

本文簡要介紹 pyspark.ml.fpm.FPGrowth 的用法。

用法: class pyspark.ml.fpm.FPGrowth(*, minSupport=0.3, minConfidence=0.8, itemsCol='items', predictionCol='prediction', numPartitions=None)

用於挖掘頻繁項集的並行FP-growth算法。

2.2.0 版中的新函數。

注意：

Li 等人的 PFP: Parallel FP-Growth for QueryRecommendation [1] 中說明了該算法。 PFP 以這樣的方式分配計算，即每個工作人員執行一組獨立的挖掘任務。 Han 等人在無需生成候選的情況下挖掘頻繁模式 [2] 中說明了 FP-Growth 算法

fit() 期間忽略特征列中的 NULL 值。

內部 transform collects 和 broadcasts 關聯規則。

1：

Haoyuan Li、Yi Wang、Dong Zhang、Ming Zhang 和 Edward Y. Chang。 2008. Pfp：用於查詢推薦的並行fp-growth。在 2008 年 ACM 推薦係統會議論文集中 (RecSys ‘08)。計算機協會，紐約，紐約，美國，107-114。 DOI：https://doi.org/10.1145/1454008.1454027

2：

韓嘉偉、簡培、尹伊文。 2000. 在沒有候選生成的情況下挖掘頻繁模式。 SIGMOD 推薦29，2(2000 年 6 月)，1-12。 DOI：https://doi.org/10.1145/335191.335372

例子：

>>> from pyspark.sql.functions import split
>>> data = (spark.read
...     .text("data/mllib/sample_fpgrowth.txt")
...     .select(split("value", "\s+").alias("items")))
>>> data.show(truncate=False)
+------------------------+
|items                   |
+------------------------+
|[r, z, h, k, p]         |
|[z, y, x, w, v, u, t, s]|
|[s, x, o, n, r]         |
|[x, z, y, m, t, s, q, e]|
|[z]                     |
|[x, z, y, r, q, t, p]   |
+------------------------+
...
>>> fp = FPGrowth(minSupport=0.2, minConfidence=0.7)
>>> fpm = fp.fit(data)
>>> fpm.setPredictionCol("newPrediction")
FPGrowthModel...
>>> fpm.freqItemsets.show(5)
+---------+----+
|    items|freq|
+---------+----+
|      [s]|   3|
|   [s, x]|   3|
|[s, x, z]|   2|
|   [s, z]|   2|
|      [r]|   3|
+---------+----+
only showing top 5 rows
...
>>> fpm.associationRules.show(5)
+----------+----------+----------+----+------------------+
|antecedent|consequent|confidence|lift|           support|
+----------+----------+----------+----+------------------+
|    [t, s]|       [y]|       1.0| 2.0|0.3333333333333333|
|    [t, s]|       [x]|       1.0| 1.5|0.3333333333333333|
|    [t, s]|       [z]|       1.0| 1.2|0.3333333333333333|
|       [p]|       [r]|       1.0| 2.0|0.3333333333333333|
|       [p]|       [z]|       1.0| 1.2|0.3333333333333333|
+----------+----------+----------+----+------------------+
only showing top 5 rows
...
>>> new_data = spark.createDataFrame([(["t", "s"], )], ["items"])
>>> sorted(fpm.transform(new_data).first().newPrediction)
['x', 'y', 'z']
>>> model_path = temp_path + "/fpm_model"
>>> fpm.save(model_path)
>>> model2 = FPGrowthModel.load(model_path)
>>> fpm.transform(data).take(1) == model2.transform(data).take(1)
True

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.ml.fpm.FPGrowth。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。