sklearn例程:真實數據集的異常值檢測

真實數據集的異常值檢測簡介

此示例介紹了對真實數據集進行魯棒協方差估計的必須性。它對於異常檢測(離群值檢測)和更好地理解數據結構都是有用的。

我們從波士頓住房數據集中選擇了兩組兩個變量的數據子集，以說明可以使用幾種離群值檢測工具進行哪些分析。出於可視化的目的，我們使用的是2維示例，但是應該指出的是，在高維度上事情並非那麽簡單。

在下麵的兩個示例來看，主要結論是經驗協方差估計(作為一種非穩健的估計)受觀測的異構結構的影響很大。盡管魯棒的協方差估計能夠集中於數據分布的主要模式，但它假定數據應該是高斯分布的，從而產生了對數據結構的某種有偏估計，不過在一定程度上還算準確。 One-Class SVM不假設數據分布的任何參數形式，因此可以更好地對數據的複雜形狀進行建模。

第一個例子

第一個示例說明了當另外一個簇存在時，魯棒的協方差估計如何幫助集中在相關簇上。在這裏，許多觀察結果被混淆為一個，讓經驗協方差估計效果變差。當然，某些篩選工具能指出存在兩個聚類(支持向量機，高斯混合模型，單變量離群值檢測……)。但是，如果這是一個高維度的例子，那麽所有這些都不容易被應用。

第二個例子

第二個示例顯示了最小協方差魯棒估計器專注於數據分布的主要模式的能力：盡管由於香蕉形分布而難以估算協方差，但位置似乎已得到很好的估計。無論如何，我們可以消除一些較遠的離群點。 One-Class SVM能夠捕獲真實的數據結構，但是困難在於如何調整其核帶寬參數，以便在數據散布矩陣的形狀和過濾合數據的風險之間取得良好的折衷。

代碼實現[Python]


# -*- coding: utf-8 -*- 
print(__doc__)

# Author: Virgile Fritsch 
# License: BSD 3 clause

import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.svm import OneClassSVM
import matplotlib.pyplot as plt
import matplotlib.font_manager
from sklearn.datasets import load_boston

# 獲取數據
X1 = load_boston()['data'][:, [8, 10]]  # two clusters
X2 = load_boston()['data'][:, [5, 12]]  # "banana"-shaped

# 定義分類器
classifiers = {
    "Empirical Covariance": EllipticEnvelope(support_fraction=1.,
                                             contamination=0.261),
    "Robust Covariance (Minimum Covariance Determinant)":
    EllipticEnvelope(contamination=0.261),
    "OCSVM": OneClassSVM(nu=0.261, gamma=0.05)}
colors = ['m', 'g', 'b']
legend1 = {}
legend2 = {}

# 通過幾個分類器學習離群點檢測的邊界
xx1, yy1 = np.meshgrid(np.linspace(-8, 28, 500), np.linspace(3, 40, 500))
xx2, yy2 = np.meshgrid(np.linspace(3, 10, 500), np.linspace(-5, 45, 500))
for i, (clf_name, clf) in enumerate(classifiers.items()):
    plt.figure(1)
    clf.fit(X1)
    Z1 = clf.decision_function(np.c_[xx1.ravel(), yy1.ravel()])
    Z1 = Z1.reshape(xx1.shape)
    legend1[clf_name] = plt.contour(
        xx1, yy1, Z1, levels=[0], linewidths=2, colors=colors[i])
    plt.figure(2)
    clf.fit(X2)
    Z2 = clf.decision_function(np.c_[xx2.ravel(), yy2.ravel()])
    Z2 = Z2.reshape(xx2.shape)
    legend2[clf_name] = plt.contour(
        xx2, yy2, Z2, levels=[0], linewidths=2, colors=colors[i])

legend1_values_list = list(legend1.values())
legend1_keys_list = list(legend1.keys())

# 繪製結果圖 (= shape of the data points cloud)
plt.figure(1)  # two clusters
plt.title("Outlier detection on a real data set (boston housing)")
plt.scatter(X1[:, 0], X1[:, 1], color='black')
bbox_args = dict(boxstyle="round", fc="0.8")
arrow_args = dict(arrowstyle="->")
plt.annotate("several confounded points", xy=(24, 19),
             xycoords="data", textcoords="data",
             xytext=(13, 10), bbox=bbox_args, arrowprops=arrow_args)
plt.xlim((xx1.min(), xx1.max()))
plt.ylim((yy1.min(), yy1.max()))
plt.legend((legend1_values_list[0].collections[0],
            legend1_values_list[1].collections[0],
            legend1_values_list[2].collections[0]),
           (legend1_keys_list[0], legend1_keys_list[1], legend1_keys_list[2]),
           loc="upper center",
           prop=matplotlib.font_manager.FontProperties(size=12))
plt.ylabel("accessibility to radial highways")
plt.xlabel("pupil-teacher ratio by town")

legend2_values_list = list(legend2.values())
legend2_keys_list = list(legend2.keys())

plt.figure(2)  # 香蕉形
plt.title("Outlier detection on a real data set (boston housing)")
plt.scatter(X2[:, 0], X2[:, 1], color='black')
plt.xlim((xx2.min(), xx2.max()))
plt.ylim((yy2.min(), yy2.max()))
plt.legend((legend2_values_list[0].collections[0],
            legend2_values_list[1].collections[0],
            legend2_values_list[2].collections[0]),
           (legend2_keys_list[0], legend2_keys_list[1], legend2_keys_list[2]),
           loc="upper center",
           prop=matplotlib.font_manager.FontProperties(size=12))
plt.ylabel("% lower status of the population")
plt.xlabel("average number of rooms per dwelling")

plt.show()