Python common.callMLlibFunc函数代码示例

本文整理汇总了Python中pyspark.mllib.common.callMLlibFunc函数的典型用法代码示例。如果您正苦于以下问题：Python callMLlibFunc函数的具体用法？Python callMLlibFunc怎么用？Python callMLlibFunc使用的例子？那么恭喜您, 这里精选的函数代码示例或许可以为您提供帮助。

在下文中一共展示了callMLlibFunc函数的15个代码示例，这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞，您的评价将有助于系统推荐出更棒的Python代码示例。

示例1: corr

    def corr(x, y=None, method=None):
        """
        Compute the correlation (matrix) for the input RDD(s) using the
        specified method.
        Methods currently supported: I{pearson (default), spearman}.

        If a single RDD of Vectors is passed in, a correlation matrix
        comparing the columns in the input RDD is returned. Use C{method=}
        to specify the method to be used for single RDD inout.
        If two RDDs of floats are passed in, a single float is returned.

        >>> x = sc.parallelize([1.0, 0.0, -2.0], 2)
        >>> y = sc.parallelize([4.0, 5.0, 3.0], 2)
        >>> zeros = sc.parallelize([0.0, 0.0, 0.0], 2)
        >>> abs(Statistics.corr(x, y) - 0.6546537) < 1e-7
        True
        >>> Statistics.corr(x, y) == Statistics.corr(x, y, "pearson")
        True
        >>> Statistics.corr(x, y, "spearman")
        0.5
        >>> from math import isnan
        >>> isnan(Statistics.corr(x, zeros))
        True
        >>> from pyspark.mllib.linalg import Vectors
        >>> rdd = sc.parallelize([Vectors.dense([1, 0, 0, -2]), Vectors.dense([4, 5, 0, 3]),
        ...                       Vectors.dense([6, 7, 0,  8]), Vectors.dense([9, 0, 0, 1])])
        >>> pearsonCorr = Statistics.corr(rdd)
        >>> print str(pearsonCorr).replace('nan', 'NaN')
        [[ 1.          0.05564149         NaN  0.40047142]
         [ 0.05564149  1.                 NaN  0.91359586]
         [        NaN         NaN  1.                 NaN]
         [ 0.40047142  0.91359586         NaN  1.        ]]
        >>> spearmanCorr = Statistics.corr(rdd, method="spearman")
        >>> print str(spearmanCorr).replace('nan', 'NaN')
        [[ 1.          0.10540926         NaN  0.4       ]
         [ 0.10540926  1.                 NaN  0.9486833 ]
         [        NaN         NaN  1.                 NaN]
         [ 0.4         0.9486833          NaN  1.        ]]
        >>> try:
        ...     Statistics.corr(rdd, "spearman")
        ...     print "Method name as second argument without 'method=' shouldn't be allowed."
        ... except TypeError:
        ...     pass
        """
        # Check inputs to determine whether a single value or a matrix is needed for output.
        # Since it's legal for users to use the method name as the second argument, we need to
        # check if y is used to specify the method name instead.
        if type(y) == str:
            raise TypeError("Use 'method=' to specify method name.")

        if not y:
            return callMLlibFunc("corr", x.map(_convert_to_vector), method).toArray()
        else:
            return callMLlibFunc("corr", x.map(float), y.map(float), method)

开发者ID:BViki，项目名称:spark，代码行数:54，代码来源:stat.py

示例2: init

 def __init__(self, predictionAndLabels):
     sc = predictionAndLabels.ctx
     sql_ctx = SQLContext.getOrCreate(sc)
     df = sql_ctx.createDataFrame(predictionAndLabels,
                                  schema=sql_ctx._inferSchema(predictionAndLabels))
     java_model = callMLlibFunc("newRankingMetrics", df._jdf)
     super(RankingMetrics, self).__init__(java_model)

开发者ID:advancedxy，项目名称:spark，代码行数:7，代码来源:evaluation.py

示例3: fit

 def fit(self, data):
     """
     Computes a [[PCAModel]] that contains the principal components of the input vectors.
     :param data: source vectors
     """
     jmodel = callMLlibFunc("fitPCA", self.k, data)
     return PCAModel(jmodel)

开发者ID:ChenZhongPu，项目名称:Simba，代码行数:7，代码来源:feature.py

示例4: train

 def train(cls, rdd, k, convergenceTol=1e-3, maxIterations=100, seed=None):
     """Train a Gaussian Mixture clustering model."""
     weight, mu, sigma = callMLlibFunc("trainGaussianMixture",
                                       rdd.map(_convert_to_vector), k,
                                       convergenceTol, maxIterations, seed)
     mvg_obj = [MultivariateGaussian(mu[i], sigma[i]) for i in range(k)]
     return GaussianMixtureModel(weight, mvg_obj)

开发者ID:Checkroth，项目名称:apriori，代码行数:7，代码来源:clustering.py

示例5: train

    def train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, nonnegative=False,
              seed=None):
        """
        Train a matrix factorization model given an RDD of ratings by users
        for a subset of products. The ratings matrix is approximated as the
        product of two lower-rank matrices of a given rank (number of
        features). To solve for these features, ALS is run iteratively with
        a configurable level of parallelism.

        :param ratings:
          RDD of `Rating` or (userID, productID, rating) tuple.
        :param rank:
          Rank of the feature matrices computed (number of features).
        :param iterations:
          Number of iterations of ALS.
          (default: 5)
        :param lambda_:
          Regularization parameter.
          (default: 0.01)
        :param blocks:
          Number of blocks used to parallelize the computation. A value
          of -1 will use an auto-configured number of blocks.
          (default: -1)
        :param nonnegative:
          A value of True will solve least-squares with nonnegativity
          constraints.
          (default: False)
        :param seed:
          Random seed for initial matrix factorization model. A value
          of None will use system time as the seed.
          (default: None)
        """
        model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, iterations,
                              lambda_, blocks, nonnegative, seed)
        return MatrixFactorizationModel(model)

开发者ID:0xqq，项目名称:spark，代码行数:35，代码来源:recommendation.py

示例6: train

 def train(cls, rdd, k, maxIterations=100, initMode="random"):
     """
     :param rdd:
       An RDD of (i, j, s\ :sub:`ij`\) tuples representing the
       affinity matrix, which is the matrix A in the PIC paper.  The
       similarity s\ :sub:`ij`\ must be nonnegative.  This is a symmetric
       matrix and hence s\ :sub:`ij`\ = s\ :sub:`ji`\  For any (i, j) with
       nonzero similarity, there should be either (i, j, s\ :sub:`ij`\) or
       (j, i, s\ :sub:`ji`\) in the input.  Tuples with i = j are ignored,
       because it is assumed s\ :sub:`ij`\ = 0.0.
     :param k:
       Number of clusters.
     :param maxIterations:
       Maximum number of iterations of the PIC algorithm.
       (default: 100)
     :param initMode:
       Initialization mode. This can be either "random" to use
       a random vector as vertex properties, or "degree" to use
       normalized sum similarities.
       (default: "random")
     """
     model = callMLlibFunc(
         "trainPowerIterationClusteringModel", rdd.map(_convert_to_vector), int(k), int(maxIterations), initMode
     )
     return PowerIterationClusteringModel(model)

开发者ID:ChineseDr，项目名称:spark，代码行数:25，代码来源:clustering.py

示例7: _train

 def _train(cls, data, algo, categoricalFeaturesInfo,
            loss, numIterations, learningRate, maxDepth, maxBins):
     first = data.first()
     assert isinstance(first, LabeledPoint), "the data should be RDD of LabeledPoint"
     model = callMLlibFunc("trainGradientBoostedTreesModel", data, algo, categoricalFeaturesInfo,
                           loss, numIterations, learningRate, maxDepth, maxBins)
     return GradientBoostedTreesModel(model)

开发者ID:0xqq，项目名称:spark，代码行数:7，代码来源:tree.py

示例8: logNormalRDD

    def logNormalRDD(sc, mean, std, size, numPartitions=None, seed=None):
        """
        Generates an RDD comprised of i.i.d. samples from the log normal
        distribution with the input mean and standard distribution.

        :param sc: SparkContext used to create the RDD.
        :param mean: mean for the log Normal distribution
        :param std: std for the log Normal distribution
        :param size: Size of the RDD.
        :param numPartitions: Number of partitions in the RDD (default: `sc.defaultParallelism`).
        :param seed: Random seed (default: a random long integer).
        :return: RDD of float comprised of i.i.d. samples ~ log N(mean, std).

        >>> from math import sqrt, exp
        >>> mean = 0.0
        >>> std = 1.0
        >>> expMean = exp(mean + 0.5 * std * std)
        >>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std))
        >>> x = RandomRDDs.logNormalRDD(sc, mean, std, 1000, seed=2)
        >>> stats = x.stats()
        >>> stats.count()
        1000
        >>> abs(stats.mean() - expMean) < 0.5
        True
        >>> from math import sqrt
        >>> abs(stats.stdev() - expStd) < 0.5
        True
        """
        return callMLlibFunc("logNormalRDD", sc._jsc, float(mean), float(std),
                             size, numPartitions, seed)

开发者ID:01azzerghhuy，项目名称:spark，代码行数:30，代码来源:random.py

示例9: update

    def update(self, data, decayFactor, timeUnit):
        """Update the centroids, according to data

        :param data:
          RDD with new data for the model update.
        :param decayFactor:
          Forgetfulness of the previous centroids.
        :param timeUnit:
          Can be "batches" or "points". If points, then the decay factor
          is raised to the power of number of new points and if batches,
          then decay factor will be used as is.
        """
        if not isinstance(data, RDD):
            raise TypeError("Data should be of an RDD, got %s." % type(data))
        data = data.map(_convert_to_vector)
        decayFactor = float(decayFactor)
        if timeUnit not in ["batches", "points"]:
            raise ValueError(
                "timeUnit should be 'batches' or 'points', got %s." % timeUnit)
        vectorCenters = [_convert_to_vector(center) for center in self.centers]
        updatedModel = callMLlibFunc(
            "updateStreamingKMeansModel", vectorCenters, self._clusterWeights,
            data, decayFactor, timeUnit)
        self.centers = array(updatedModel[0])
        self._clusterWeights = list(updatedModel[1])
        return self

开发者ID:11wzy001，项目名称:spark，代码行数:26，代码来源:clustering.py

示例10: _train

 def _train(
     cls,
     data,
     algo,
     numClasses,
     categoricalFeaturesInfo,
     numTrees,
     featureSubsetStrategy,
     impurity,
     maxDepth,
     maxBins,
     seed,
 ):
     first = data.first()
     assert isinstance(first, LabeledPoint), "the data should be RDD of LabeledPoint"
     if featureSubsetStrategy not in cls.supportedFeatureSubsetStrategies:
         raise ValueError("unsupported featureSubsetStrategy: %s" % featureSubsetStrategy)
     if seed is None:
         seed = random.randint(0, 1 << 30)
     model = callMLlibFunc(
         "trainRandomForestModel",
         data,
         algo,
         numClasses,
         categoricalFeaturesInfo,
         numTrees,
         featureSubsetStrategy,
         impurity,
         maxDepth,
         maxBins,
         seed,
     )
     return RandomForestModel(model)

开发者ID:CatKyo，项目名称:SparkInfoSystem，代码行数:33，代码来源:tree.py

示例11: train

    def train(cls, data, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000):
        """
        Finds the complete set of frequent sequential patterns in the
        input sequences of itemsets.

        :param data:
          The input data set, each element contains a sequence of
          itemsets.
        :param minSupport:
          The minimal support level of the sequential pattern, any
          pattern that appears more than (minSupport *
          size-of-the-dataset) times will be output.
          (default: 0.1)
        :param maxPatternLength:
          The maximal length of the sequential pattern, any pattern
          that appears less than maxPatternLength will be output.
          (default: 10)
        :param maxLocalProjDBSize:
          The maximum number of items (including delimiters used in the
          internal storage format) allowed in a projected database before
          local processing. If a projected database exceeds this size,
          another iteration of distributed prefix growth is run.
          (default: 32000000)
        """
        model = callMLlibFunc("trainPrefixSpanModel",
                              data, minSupport, maxPatternLength, maxLocalProjDBSize)
        return PrefixSpanModel(model)

开发者ID:0xqq，项目名称:spark，代码行数:27，代码来源:fpm.py

示例12: loadLabeledPoints

    def loadLabeledPoints(sc, path, minPartitions=None):
        """
        Load labeled points saved using RDD.saveAsTextFile.

        :param sc: Spark context
        :param path: file or directory path in any Hadoop-supported file
                     system URI
        :param minPartitions: min number of partitions
        @return: labeled data stored as an RDD of LabeledPoint

        >>> from tempfile import NamedTemporaryFile
        >>> from pyspark.mllib.util import MLUtils
        >>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, -1.23), (2, 4.56e-7)])), \
                        LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
        >>> tempFile = NamedTemporaryFile(delete=True)
        >>> tempFile.close()
        >>> sc.parallelize(examples, 1).saveAsTextFile(tempFile.name)
        >>> loaded = MLUtils.loadLabeledPoints(sc, tempFile.name).collect()
        >>> type(loaded[0]) == LabeledPoint
        True
        >>> print examples[0]
        (1.1,(3,[0,2],[-1.23,4.56e-07]))
        >>> type(examples[1]) == LabeledPoint
        True
        >>> print examples[1]
        (0.0,[1.01,2.02,3.03])
        """
        minPartitions = minPartitions or min(sc.defaultParallelism, 2)
        return callMLlibFunc("loadLabeledPoints", sc, path, minPartitions)

开发者ID:BViki，项目名称:spark，代码行数:29，代码来源:util.py

示例13: train

 def train(
     cls,
     rdd,
     k,
     maxIterations=100,
     runs=1,
     initializationMode="k-means||",
     seed=None,
     initializationSteps=5,
     epsilon=1e-4,
     initialModel=None,
 ):
     """Train a k-means clustering model."""
     clusterInitialModel = []
     if initialModel is not None:
         if not isinstance(initialModel, KMeansModel):
             raise Exception(
                 "initialModel is of " + str(type(initialModel)) + ". It needs " "to be of <type 'KMeansModel'>"
             )
         clusterInitialModel = [_convert_to_vector(c) for c in initialModel.clusterCenters]
     model = callMLlibFunc(
         "trainKMeansModel",
         rdd.map(_convert_to_vector),
         k,
         maxIterations,
         runs,
         initializationMode,
         seed,
         initializationSteps,
         epsilon,
         clusterInitialModel,
     )
     centers = callJavaFunc(rdd.context, model.clusterCenters)
     return KMeansModel([c.toArray() for c in centers])

开发者ID:BeforeRain，项目名称:spark，代码行数:34，代码来源:clustering.py

示例14: gammaRDD

    def gammaRDD(sc, shape, scale, size, numPartitions=None, seed=None):
        """
        Generates an RDD comprised of i.i.d. samples from the Gamma
        distribution with the input shape and scale.

        :param sc: SparkContext used to create the RDD.
        :param shape: shape (> 0) parameter for the Gamma distribution
        :param scale: scale (> 0) parameter for the Gamma distribution
        :param size: Size of the RDD.
        :param numPartitions: Number of partitions in the RDD (default: `sc.defaultParallelism`).
        :param seed: Random seed (default: a random long integer).
        :return: RDD of float comprised of i.i.d. samples ~ Gamma(shape, scale).

        >>> from math import sqrt
        >>> shape = 1.0
        >>> scale = 2.0
        >>> expMean = shape * scale
        >>> expStd = sqrt(shape * scale * scale)
        >>> x = RandomRDDs.gammaRDD(sc, shape, scale, 1000, seed=2)
        >>> stats = x.stats()
        >>> stats.count()
        1000
        >>> abs(stats.mean() - expMean) < 0.5
        True
        >>> abs(stats.stdev() - expStd) < 0.5
        True
        """
        return callMLlibFunc("gammaRDD", sc._jsc, float(shape),
                             float(scale), size, numPartitions, seed)

开发者ID:01azzerghhuy，项目名称:spark，代码行数:29，代码来源:random.py

示例15: train

    def train(self, rdd, k=4, maxIterations=20, minDivisibleClusterSize=1.0, seed=-1888008604):
        """
        Runs the bisecting k-means algorithm return the model.

        :param rdd:
          Training points as an `RDD` of `Vector` or convertible
          sequence types.
        :param k:
          The desired number of leaf clusters. The actual number could
          be smaller if there are no divisible leaf clusters.
          (default: 4)
        :param maxIterations:
          Maximum number of iterations allowed to split clusters.
          (default: 20)
        :param minDivisibleClusterSize:
          Minimum number of points (if >= 1.0) or the minimum proportion
          of points (if < 1.0) of a divisible cluster.
          (default: 1)
        :param seed:
          Random seed value for cluster initialization.
          (default: -1888008604 from classOf[BisectingKMeans].getName.##)
        """
        java_model = callMLlibFunc(
            "trainBisectingKMeans", rdd.map(_convert_to_vector),
            k, maxIterations, minDivisibleClusterSize, seed)
        return BisectingKMeansModel(java_model)