Spark机器学习库指南[Spark 1.3.1版]——树的集成(ensembles of trees)

下面是章节树的集成的目录(参见决策树，其他内容参见全文目录)

梯度提升树 vs. 随机森林
随机森林
- 基础算法
  - 训练
  - 预测
- 使用建议
- 示例
  - 分类
  - 回归
梯度提升树 (GBTs)
- 基础算法
  - 损失函数
- 使用建议
- 示例
  - 分类
  - 回归

集成方法是使用基本模型创建组合模型的学习方法。MLlib支持两种集成算法：GradientBoostedTrees and RandomForest，即梯度提升树和随机森林。这两个算法都使用决策树作为基础模型。

梯度提升树 vs. 随机森林

梯度提升树（GBTS）和随机森林都是学习集成树的算法，但是训练过程是不一样的。有下列实用上的权衡：

GBTs一次只训练一棵树，所以比随机森林要花费更多的训练时间。随机森林可以并行地训练多棵树。
- 另一方面，通常使用GBTs而不是随机森林来训练小（浅）的树，训练小树花的时间也少。
随机森林不容易过拟合。在随机森林中训练更多的树减少了过拟合的可能性，但是在GBT中训练更多的树会增加过拟合的可能性。（在统计语言中，随机森林通过多棵树减少variance(方差)， GBTs通过多棵树减少bias（偏置））。
随机森林更容易调优，因为效果随着树数量的增加单调提升（但是对于GBTs来说，当树的数量增长到很大的时候，效果反而会下降）。

总之，两个算法都可以是高效的，但要基于具体的数据集来选择。

随机森林

随机森林是决策树的集成。随机森林是用作分类和回归的最成功的机器学习模型之一。它联合多棵决策树从而减少过拟合的风险。跟决策树一样，随机森林能处理类别型数据，可扩展到多分类，不需要特征尺度变换，并且are able to capture non-linearities and feature interactions。

MLlib支持的随机森林可用于二分类、多分类和回归，既可使用连续特征又可以使用类别型特征。MLlib通过已有的决策树来实现随机森林。可以参考决策树指南来了解跟多树的信息。

基础算法

随机森林独立训练每棵树，所以训练可以并行化。该算法给训练过程注入了随机性，所以每棵树都不太一样。联合多可树的预测减下了预测的方差，提升了测试集上的预测效果。

训练

向训练过程注入随机性包括：

每论迭代中再次抽样，从而获得不同的训练集（又叫bootstrapping）。
使用不同的随机特征子集来做树节点的分裂。

除了上面的随机性，随机森林中的决策树训练跟独立决策树的训练是一样的。

预测

为了预测一个新的实例，随机森林需要聚合所有决策树的预测结果。这个聚合对于分类和回归是不一样的。

分类：多数投票原则。每棵树的预测结果计为对应类型的一票。最后总结果是得票最多的标签。

回归：平均值。每棵树预测一个实数值。最后结果是多有树预测结果的平均值。

使用建议

接下俩我们讨论使用随机森林的各种参数。我们略去了决策树的一些参数，因为之前的章节决策树已经做了介绍。

我们最先提到的两个参数是最重要的，因为调整这两个参数通常可以提升效果：

numTrees: 森林中树的数量。
- 提升树的数量可以减少预测的方差，提升模型测试的准确率。
- 训练时间大致随树的数量线性增加。
maxDepth: 森林中每棵树的最大深度。
- 增加数的深度可以是模型的表达能力更强。但是，深度大的树需要更长的时间训练并且更容易过拟合。
- 通常，随机森林可以比单棵树拥有更大的深度。一棵树比随机森林更容易产生过拟合问题（因为随机森林中通过平均值减少方差）。

下面两个参数通常不需要调整。但是，调整他们可以加速训练过程。

subsamplingRate: 这个参数指定了森林中每棵树训练的数据的大小，它是当前数据集大小占总数据大小的比例。推荐默认值1.0，减小这个值可以提升训练速度。
featureSubsetStrategy: 每棵树中使用的特征数量。这个参数可用小数比例的形式指定，也可以是总特征数量的函数。减少这个值会加速训练，但是如果太小会影响效果。

示例

分类

下面的例子说明了怎样导入LIBSVM数据文件，解析为RDD[LabeledPoint]，然后使用随机森林进行分类。最后计算测试误差从而评估模型的准确率。

from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Note: Use larger numTrees in practice.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=3, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification forest model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "myModelPath")
sameModel = RandomForestModel.load(sc, "myModelPath")

回归

下面的例子说明了怎样导入LIBSVM数据文件，解析为RDD[LabeledPoint]，然后使用随机森林进行回归。最后计算均方误差（MSE）来评估拟合度。

from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Note: Use larger numTrees in practice.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
                                    numTrees=3, featureSubsetStrategy="auto",
                                    impurity='variance', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() / float(testData.count())
print('Test Mean Squared Error = ' + str(testMSE))
print('Learned regression forest model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "myModelPath")
sameModel = RandomForestModel.load(sc, "myModelPath")

梯度提升树 (GBTs)

梯度提升树(GBTs)是决策树的集成。GBTs迭代地训练决策树从而最小化损失函数。跟决策树一样，GBTs可以处理类别型特征，可扩展到多分类，不需要特征尺度变换，并且are able to capture non-linearities and feature interactions。

MLlib支持使用GBTs做二分类和回归，能够处理连续特征和类别型特征。MLlib使用已有的决策树来实现GBTs。可以参考决策树那一节了解更多树的信息。

注意：GBTs目前不支持多分类。对于多分类问题，请使用决策树和随机森林。

基础算法

梯度提升算法迭代地训练一系列的决策树。在每一次迭代中，算法使用当前的集成预测每一个训练实例，比较预测值和目标标签；数据集会被重新标记：预测较差的训练样本会获得更多的关注。下一次迭代中，决策树会修正上一次的错误。

重新标记实例的机制是通过损失函数来定义的。在每一轮迭代中，GBTs进一步减少训练集上损失函数的值。

损失函数

下表列出了当前MLlib GBTs支持的损失函数。注意每个损失函数要么支持分类，要么支持回归，不是分类和回归都支持。

符号说明：N = 实例的数量；y_i = 实例i的标签；x_i = 实例i的特征；F(x_i) = 模型对实例i的预测标签。

Loss	Task	Description
Log损失	分类	二次项负对数似然（Twice binomial negative log likelihood）.
平方误差	回归	也叫损失. 回归任务的默认损失函数
绝对值误差	回归	也叫L1损失。相比平方误差，在极端值上有更好的鲁棒性。

使用建议

接下来我们讨论了GBTs的各种参数。这里略去决策树的参数说明，因为之前的决策树章节中已经做了介绍。

loss: 该参数就是上文提到的损失函数。不同的损失函数会导致大不相同的结果。
numIterations: 这个参数设置集成迭代的次数。每次迭代产生一棵树。增加这个值能是模型有更强的表达能力，提升训练数据的准确度。但是如果这个值过大，测试集上的准确度可能比较差。
learningRate: 这个参数不应该被调整。如果算法不稳定，减少这个值可能提升稳定性。
algo: 算法或者说任务名（classification vs. regression)

示例

分类

下面的算法说明了怎样导入LIBSVM数据文件，解析为RDD[LabeledPoint], 然后执行梯度提升树做分类，使用的是log损失函数。最后计算测试误差评估模型的准确率。

from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file.
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GradientBoostedTrees model.
#  Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.
#         (b) Use more iterations in practice.
model = GradientBoostedTrees.trainClassifier(trainingData,
    categoricalFeaturesInfo={}, numIterations=3)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification GBT model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "myModelPath")
sameModel = GradientBoostedTreesModel.load(sc, "myModelPath")

回归

下面的例子说明了如何导入LIBSVM数据文件，解析为RDD[LabeledPoint], 然后执行梯度提升树算法做回归，使用的是平方误差损失函数。最后计算均方误差来评估拟合度。

from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file.
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GradientBoostedTrees model.
#  Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.
#         (b) Use more iterations in practice.
model = GradientBoostedTrees.trainRegressor(trainingData,
    categoricalFeaturesInfo={}, numIterations=3)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() / float(testData.count())
print('Test Mean Squared Error = ' + str(testMSE))
print('Learned regression GBT model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "myModelPath")
sameModel = GradientBoostedTreesModel.load(sc, "myModelPath")