Python pyspark Word2Vec用法及代码示例

本文简要介绍 pyspark.ml.feature.Word2Vec 的用法。

用法: class pyspark.ml.feature.Word2Vec(*, vectorSize=100, minCount=5, numPartitions=1, stepSize=0.025, maxIter=1, seed=None, inputCol=None, outputCol=None, windowSize=5, maxSentenceLength=1000)

Word2Vec 训练 Map(String, Vector) 的模型，即将单词转换为代码以进行进一步的自然语言处理或机器学习过程。

1.4.0 版中的新函数。

例子：

>>> sent = ("a b " * 100 + "a c " * 10).split(" ")
>>> doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
>>> word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model")
>>> word2Vec.setMaxIter(10)
Word2Vec...
>>> word2Vec.getMaxIter()
10
>>> word2Vec.clear(word2Vec.maxIter)
>>> model = word2Vec.fit(doc)
>>> model.getMinCount()
5
>>> model.setInputCol("sentence")
Word2VecModel...
>>> model.getVectors().show()
+----+--------------------+
|word|              vector|
+----+--------------------+
|   a|[0.0951...
|   b|[-1.202...
|   c|[0.3015...
+----+--------------------+
...
>>> model.findSynonymsArray("a", 2)
[('b', 0.015859...), ('c', -0.568079...)]
>>> from pyspark.sql.functions import format_number as fmt
>>> model.findSynonyms("a", 2).select("word", fmt("similarity", 5).alias("similarity")).show()
+----+----------+
|word|similarity|
+----+----------+
|   b|   0.01586|
|   c|  -0.56808|
+----+----------+
...
>>> model.transform(doc).head().model
DenseVector([-0.4833, 0.1855, -0.273, -0.0509, -0.4769])
>>> word2vecPath = temp_path + "/word2vec"
>>> word2Vec.save(word2vecPath)
>>> loadedWord2Vec = Word2Vec.load(word2vecPath)
>>> loadedWord2Vec.getVectorSize() == word2Vec.getVectorSize()
True
>>> loadedWord2Vec.getNumPartitions() == word2Vec.getNumPartitions()
True
>>> loadedWord2Vec.getMinCount() == word2Vec.getMinCount()
True
>>> modelPath = temp_path + "/word2vec-model"
>>> model.save(modelPath)
>>> loadedModel = Word2VecModel.load(modelPath)
>>> loadedModel.getVectors().first().word == model.getVectors().first().word
True
>>> loadedModel.getVectors().first().vector == model.getVectors().first().vector
True
>>> loadedModel.transform(doc).take(1) == model.transform(doc).take(1)
True

相关用法

注：本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.ml.feature.Word2Vec。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。