Spark機器學習庫指南[Spark 1.3.1版]——樹的集成(ensembles of trees)

下麵是章節樹的集成的目錄(參見決策樹，其他內容參見全文目錄)

梯度提升樹 vs. 隨機森林
隨機森林
- 基礎算法
  - 訓練
  - 預測
- 使用建議
- 示例
  - 分類
  - 回歸
梯度提升樹 (GBTs)
- 基礎算法
  - 損失函數
- 使用建議
- 示例
  - 分類
  - 回歸

集成方法是使用基本模型創建組合模型的學習方法。MLlib支持兩種集成算法：GradientBoostedTrees and RandomForest，即梯度提升樹和隨機森林。這兩個算法都使用決策樹作為基礎模型。

梯度提升樹 vs. 隨機森林

梯度提升樹（GBTS）和隨機森林都是學習集成樹的算法，但是訓練過程是不一樣的。有下列實用上的權衡：

GBTs一次隻訓練一棵樹，所以比隨機森林要花費更多的訓練時間。隨機森林可以並行地訓練多棵樹。
- 另一方麵，通常使用GBTs而不是隨機森林來訓練小（淺）的樹，訓練小樹花的時間也少。
隨機森林不容易過擬合。在隨機森林中訓練更多的樹減少了過擬合的可能性，但是在GBT中訓練更多的樹會增加過擬合的可能性。（在統計語言中，隨機森林通過多棵樹減少variance(方差)， GBTs通過多棵樹減少bias（偏置））。
隨機森林更容易調優，因為效果隨著樹數量的增加單調提升（但是對於GBTs來說，當樹的數量增長到很大的時候，效果反而會下降）。

總之，兩個算法都可以是高效的，但要基於具體的數據集來選擇。

隨機森林

隨機森林是決策樹的集成。隨機森林是用作分類和回歸的最成功的機器學習模型之一。它聯合多棵決策樹從而減少過擬合的風險。跟決策樹一樣，隨機森林能處理類別型數據，可擴展到多分類，不需要特征尺度變換，並且are able to capture non-linearities and feature interactions。

MLlib支持的隨機森林可用於二分類、多分類和回歸，既可使用連續特征又可以使用類別型特征。MLlib通過已有的決策樹來實現隨機森林。可以參考決策樹指南來了解跟多樹的信息。

基礎算法

隨機森林獨立訓練每棵樹，所以訓練可以並行化。該算法給訓練過程注入了隨機性，所以每棵樹都不太一樣。聯合多可樹的預測減下了預測的方差，提升了測試集上的預測效果。

訓練

向訓練過程注入隨機性包括：

每論迭代中再次抽樣，從而獲得不同的訓練集（又叫bootstrapping）。
使用不同的隨機特征子集來做樹節點的分裂。

除了上麵的隨機性，隨機森林中的決策樹訓練跟獨立決策樹的訓練是一樣的。

預測

為了預測一個新的實例，隨機森林需要聚合所有決策樹的預測結果。這個聚合對於分類和回歸是不一樣的。

分類：多數投票原則。每棵樹的預測結果計為對應類型的一票。最後總結果是得票最多的標簽。

回歸：平均值。每棵樹預測一個實數值。最後結果是多有樹預測結果的平均值。

使用建議

接下倆我們討論使用隨機森林的各種參數。我們略去了決策樹的一些參數，因為之前的章節決策樹已經做了介紹。

我們最先提到的兩個參數是最重要的，因為調整這兩個參數通常可以提升效果：

numTrees: 森林中樹的數量。
- 提升樹的數量可以減少預測的方差，提升模型測試的準確率。
- 訓練時間大致隨樹的數量線性增加。
maxDepth: 森林中每棵樹的最大深度。
- 增加數的深度可以是模型的表達能力更強。但是，深度大的樹需要更長的時間訓練並且更容易過擬合。
- 通常，隨機森林可以比單棵樹擁有更大的深度。一棵樹比隨機森林更容易產生過擬合問題（因為隨機森林中通過平均值減少方差）。

下麵兩個參數通常不需要調整。但是，調整他們可以加速訓練過程。

subsamplingRate: 這個參數指定了森林中每棵樹訓練的數據的大小，它是當前數據集大小占總數據大小的比例。推薦默認值1.0，減小這個值可以提升訓練速度。
featureSubsetStrategy: 每棵樹中使用的特征數量。這個參數可用小數比例的形式指定，也可以是總特征數量的函數。減少這個值會加速訓練，但是如果太小會影響效果。

示例

分類

下麵的例子說明了怎樣導入LIBSVM數據文件，解析為RDD[LabeledPoint]，然後使用隨機森林進行分類。最後計算測試誤差從而評估模型的準確率。

from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Note: Use larger numTrees in practice.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=3, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification forest model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "myModelPath")
sameModel = RandomForestModel.load(sc, "myModelPath")

回歸

下麵的例子說明了怎樣導入LIBSVM數據文件，解析為RDD[LabeledPoint]，然後使用隨機森林進行回歸。最後計算均方誤差（MSE）來評估擬合度。

from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Note: Use larger numTrees in practice.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
                                    numTrees=3, featureSubsetStrategy="auto",
                                    impurity='variance', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() / float(testData.count())
print('Test Mean Squared Error = ' + str(testMSE))
print('Learned regression forest model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "myModelPath")
sameModel = RandomForestModel.load(sc, "myModelPath")

梯度提升樹 (GBTs)

梯度提升樹(GBTs)是決策樹的集成。GBTs迭代地訓練決策樹從而最小化損失函數。跟決策樹一樣，GBTs可以處理類別型特征，可擴展到多分類，不需要特征尺度變換，並且are able to capture non-linearities and feature interactions。

MLlib支持使用GBTs做二分類和回歸，能夠處理連續特征和類別型特征。MLlib使用已有的決策樹來實現GBTs。可以參考決策樹那一節了解更多樹的信息。

注意：GBTs目前不支持多分類。對於多分類問題，請使用決策樹和隨機森林。

基礎算法

梯度提升算法迭代地訓練一係列的決策樹。在每一次迭代中，算法使用當前的集成預測每一個訓練實例，比較預測值和目標標簽；數據集會被重新標記：預測較差的訓練樣本會獲得更多的關注。下一次迭代中，決策樹會修正上一次的錯誤。

重新標記實例的機製是通過損失函數來定義的。在每一輪迭代中，GBTs進一步減少訓練集上損失函數的值。

損失函數

下表列出了當前MLlib GBTs支持的損失函數。注意每個損失函數要麽支持分類，要麽支持回歸，不是分類和回歸都支持。

符號說明：N = 實例的數量；y_i = 實例i的標簽；x_i = 實例i的特征；F(x_i) = 模型對實例i的預測標簽。

Loss	Task	Description
Log損失	分類	二次項負對數似然（Twice binomial negative log likelihood）.
平方誤差	回歸	也叫損失. 回歸任務的默認損失函數
絕對值誤差	回歸	也叫L1損失。相比平方誤差，在極端值上有更好的魯棒性。

使用建議

接下來我們討論了GBTs的各種參數。這裏略去決策樹的參數說明，因為之前的決策樹章節中已經做了介紹。

loss: 該參數就是上文提到的損失函數。不同的損失函數會導致大不相同的結果。
numIterations: 這個參數設置集成迭代的次數。每次迭代產生一棵樹。增加這個值能是模型有更強的表達能力，提升訓練數據的準確度。但是如果這個值過大，測試集上的準確度可能比較差。
learningRate: 這個參數不應該被調整。如果算法不穩定，減少這個值可能提升穩定性。
algo: 算法或者說任務名（classification vs. regression)

示例

分類

下麵的算法說明了怎樣導入LIBSVM數據文件，解析為RDD[LabeledPoint], 然後執行梯度提升樹做分類，使用的是log損失函數。最後計算測試誤差評估模型的準確率。

from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file.
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GradientBoostedTrees model.
#  Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.
#         (b) Use more iterations in practice.
model = GradientBoostedTrees.trainClassifier(trainingData,
    categoricalFeaturesInfo={}, numIterations=3)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification GBT model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "myModelPath")
sameModel = GradientBoostedTreesModel.load(sc, "myModelPath")

回歸

下麵的例子說明了如何導入LIBSVM數據文件，解析為RDD[LabeledPoint], 然後執行梯度提升樹算法做回歸，使用的是平方誤差損失函數。最後計算均方誤差來評估擬合度。

from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file.
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GradientBoostedTrees model.
#  Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.
#         (b) Use more iterations in practice.
model = GradientBoostedTrees.trainRegressor(trainingData,
    categoricalFeaturesInfo={}, numIterations=3)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() / float(testData.count())
print('Test Mean Squared Error = ' + str(testMSE))
print('Learned regression GBT model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "myModelPath")
sameModel = GradientBoostedTreesModel.load(sc, "myModelPath")