本文簡要介紹
pyspark.ml.feature.MinHashLSH
的用法。用法:
class pyspark.ml.feature.MinHashLSH(*, inputCol=None, outputCol=None, seed=None, numHashTables=1)
Jaccard 距離的 LSH 類。輸入可以是密集或稀疏向量,但如果是稀疏的,則效率更高。例如,
Vectors.sparse(10, [(2, 1.0), (3, 1.0), (5, 1.0)])
表示空間中有 10 個元素。該集合包含元素 2、3 和 5。此外,任何輸入向量必須至少有 1 個非零索引,並且所有非零值都被視為二進製 “1” 值。2.2.0 版中的新函數。
注意:
例子:
>>> from pyspark.ml.linalg import Vectors >>> from pyspark.sql.functions import col >>> data = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),), ... (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),), ... (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)] >>> df = spark.createDataFrame(data, ["id", "features"]) >>> mh = MinHashLSH() >>> mh.setInputCol("features") MinHashLSH... >>> mh.setOutputCol("hashes") MinHashLSH... >>> mh.setSeed(12345) MinHashLSH... >>> model = mh.fit(df) >>> model.setInputCol("features") MinHashLSHModel... >>> model.transform(df).head() Row(id=0, features=SparseVector(6, {0: 1.0, 1: 1.0, 2: 1.0}), hashes=[DenseVector([6179668... >>> data2 = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),), ... (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),), ... (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)] >>> df2 = spark.createDataFrame(data2, ["id", "features"]) >>> key = Vectors.sparse(6, [1, 2], [1.0, 1.0]) >>> model.approxNearestNeighbors(df2, key, 1).collect() [Row(id=5, features=SparseVector(6, {1: 1.0, 2: 1.0, 4: 1.0}), hashes=[DenseVector([6179668... >>> model.approxSimilarityJoin(df, df2, 0.6, distCol="JaccardDistance").select( ... col("datasetA.id").alias("idA"), ... col("datasetB.id").alias("idB"), ... col("JaccardDistance")).show() +---+---+---------------+ |idA|idB|JaccardDistance| +---+---+---------------+ | 0| 5| 0.5| | 1| 4| 0.5| +---+---+---------------+ ... >>> mhPath = temp_path + "/mh" >>> mh.save(mhPath) >>> mh2 = MinHashLSH.load(mhPath) >>> mh2.getOutputCol() == mh.getOutputCol() True >>> modelPath = temp_path + "/mh-model" >>> model.save(modelPath) >>> model2 = MinHashLSHModel.load(modelPath) >>> model.transform(df).head().hashes == model2.transform(df).head().hashes True
相關用法
- Python pyspark MinMaxScaler用法及代碼示例
- Python pyspark MultiIndex.size用法及代碼示例
- Python pyspark MultiIndex.hasnans用法及代碼示例
- Python pyspark MultiIndex.to_numpy用法及代碼示例
- Python pyspark MultiIndex.levshape用法及代碼示例
- Python pyspark MultiIndex.max用法及代碼示例
- Python pyspark MultiIndex.drop用法及代碼示例
- Python pyspark MultiIndex.min用法及代碼示例
- Python pyspark MultiIndex.unique用法及代碼示例
- Python pyspark MultiIndex.rename用法及代碼示例
- Python pyspark MultiIndex.value_counts用法及代碼示例
- Python pyspark MatrixFactorizationModel用法及代碼示例
- Python pyspark MultiIndex.values用法及代碼示例
- Python pyspark MultiIndex.difference用法及代碼示例
- Python pyspark MultiIndex.sort_values用法及代碼示例
- Python pyspark MLUtils.loadLibSVMFile用法及代碼示例
- Python pyspark MultiIndex.spark.transform用法及代碼示例
- Python pyspark MaxAbsScaler用法及代碼示例
- Python pyspark MultiIndex.T用法及代碼示例
- Python pyspark MultiIndex用法及代碼示例
- Python pyspark MultiIndex.ndim用法及代碼示例
- Python pyspark MulticlassClassificationEvaluator用法及代碼示例
- Python pyspark MultiIndex.copy用法及代碼示例
- Python pyspark MultiIndex.to_frame用法及代碼示例
- Python pyspark MultiIndex.shape用法及代碼示例
注:本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.ml.feature.MinHashLSH。非經特殊聲明,原始代碼版權歸原作者所有,本譯文未經允許或授權,請勿轉載或複製。