Python pyspark NGram用法及代碼示例

本文簡要介紹 pyspark.ml.feature.NGram 的用法。

用法: class pyspark.ml.feature.NGram(*, n=2, inputCol=None, outputCol=None)

將輸入字符串數組轉換為n-grams 數組的特征轉換器。輸入數組中的空值將被忽略。它返回一個 n-grams 數組，其中每個 n-gram 由空格分隔的字符串表示。當輸入為空時，返回一個空數組。當輸入數組長度小於 n(每個 n-gram 的元素數)時，不返回 n-grams。

1.5.0 版中的新函數。

例子：

>>> df = spark.createDataFrame([Row(inputTokens=["a", "b", "c", "d", "e"])])
>>> ngram = NGram(n=2)
>>> ngram.setInputCol("inputTokens")
NGram...
>>> ngram.setOutputCol("nGrams")
NGram...
>>> ngram.transform(df).head()
Row(inputTokens=['a', 'b', 'c', 'd', 'e'], nGrams=['a b', 'b c', 'c d', 'd e'])
>>> # Change n-gram length
>>> ngram.setParams(n=4).transform(df).head()
Row(inputTokens=['a', 'b', 'c', 'd', 'e'], nGrams=['a b c d', 'b c d e'])
>>> # Temporarily modify output column.
>>> ngram.transform(df, {ngram.outputCol: "output"}).head()
Row(inputTokens=['a', 'b', 'c', 'd', 'e'], output=['a b c d', 'b c d e'])
>>> ngram.transform(df).head()
Row(inputTokens=['a', 'b', 'c', 'd', 'e'], nGrams=['a b c d', 'b c d e'])
>>> # Must use keyword arguments to specify params.
>>> ngram.setParams("text")
Traceback (most recent call last):
    ...
TypeError: Method setParams forces keyword arguments.
>>> ngramPath = temp_path + "/ngram"
>>> ngram.save(ngramPath)
>>> loadedNGram = NGram.load(ngramPath)
>>> loadedNGram.getN() == ngram.getN()
True
>>> loadedNGram.transform(df).take(1) == ngram.transform(df).take(1)
True

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.ml.feature.NGram。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。