本文簡要介紹
pyspark.ml.feature.RegexTokenizer
的用法。用法:
class pyspark.ml.feature.RegexTokenizer(*, minTokenLength=1, gaps=True, pattern='\\s+', inputCol=None, outputCol=None, toLowercase=True)
一個基於正則表達式的分詞器,它通過使用提供的正則表達式模式(Java 方言)來拆分文本(默認)或重複匹配正則表達式(如果 gaps 為 false)來提取令牌。可選參數還允許使用最小長度過濾令牌。它返回一個可以為空的字符串數組。
1.4.0 版中的新函數。
例子:
>>> df = spark.createDataFrame([("A B c",)], ["text"]) >>> reTokenizer = RegexTokenizer() >>> reTokenizer.setInputCol("text") RegexTokenizer... >>> reTokenizer.setOutputCol("words") RegexTokenizer... >>> reTokenizer.transform(df).head() Row(text='A B c', words=['a', 'b', 'c']) >>> # Change a parameter. >>> reTokenizer.setParams(outputCol="tokens").transform(df).head() Row(text='A B c', tokens=['a', 'b', 'c']) >>> # Temporarily modify a parameter. >>> reTokenizer.transform(df, {reTokenizer.outputCol: "words"}).head() Row(text='A B c', words=['a', 'b', 'c']) >>> reTokenizer.transform(df).head() Row(text='A B c', tokens=['a', 'b', 'c']) >>> # Must use keyword arguments to specify params. >>> reTokenizer.setParams("text") Traceback (most recent call last): ... TypeError: Method setParams forces keyword arguments. >>> regexTokenizerPath = temp_path + "/regex-tokenizer" >>> reTokenizer.save(regexTokenizerPath) >>> loadedReTokenizer = RegexTokenizer.load(regexTokenizerPath) >>> loadedReTokenizer.getMinTokenLength() == reTokenizer.getMinTokenLength() True >>> loadedReTokenizer.getGaps() == reTokenizer.getGaps() True >>> loadedReTokenizer.transform(df).take(1) == reTokenizer.transform(df).take(1) True
相關用法
- Python pyspark RegressionEvaluator用法及代碼示例
- Python pyspark RegressionMetrics用法及代碼示例
- Python pyspark RDD.saveAsTextFile用法及代碼示例
- Python pyspark RDD.keyBy用法及代碼示例
- Python pyspark RDD.sumApprox用法及代碼示例
- Python pyspark RowMatrix.numCols用法及代碼示例
- Python pyspark RowMatrix.computePrincipalComponents用法及代碼示例
- Python pyspark RDD.lookup用法及代碼示例
- Python pyspark RDD.zipWithIndex用法及代碼示例
- Python pyspark RDD.sampleByKey用法及代碼示例
- Python pyspark Rolling.mean用法及代碼示例
- Python pyspark Rolling.max用法及代碼示例
- Python pyspark RDD.coalesce用法及代碼示例
- Python pyspark RDD.subtract用法及代碼示例
- Python pyspark RDD.count用法及代碼示例
- Python pyspark RankingEvaluator用法及代碼示例
- Python pyspark RandomRDDs.uniformRDD用法及代碼示例
- Python pyspark RDD.groupWith用法及代碼示例
- Python pyspark RDD.distinct用法及代碼示例
- Python pyspark RDD.treeAggregate用法及代碼示例
- Python pyspark RowMatrix.computeSVD用法及代碼示例
- Python pyspark RowMatrix.multiply用法及代碼示例
- Python pyspark RandomForest.trainRegressor用法及代碼示例
- Python pyspark RandomRDDs.exponentialRDD用法及代碼示例
- Python pyspark RDD.mapPartitionsWithIndex用法及代碼示例
注:本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.ml.feature.RegexTokenizer。非經特殊聲明,原始代碼版權歸原作者所有,本譯文未經允許或授權,請勿轉載或複製。