异常检测

来源：互联网发布：网络销售贵金属好做吗编辑：程序博客网时间：2024/04/30 03:45

异常检测

异常检测(Anomaly Detection):异常检测就是从数据集中检测出异常样本，是一种无监督学习。

引例

飞机制造商在飞机引擎从生产线上流出时，会考虑进行异常检测，以防止不合格引擎对整机造成的巨大影响，而为了进行异常检测，通常就需要采集一些特征，比如会采集如下特征：

x1=引擎运转时产生的热量
x2=引擎的振荡频率

对于一系列的数据集（特征向量集合）：x(1),⋯,x(m)x(1),⋯,x(m), 这些数据都是正常样本，我们将其绘制到二维平面上：

如果一个新的测试样本居于样本布密度较大的地方如：

正常飞机引擎

那么我们有很大的把握认为这个测试样本是正常的。
反之如果一个新的测试样本远离分布集中的地方如：

异常飞机引擎

那么我们也有很大的把握认为这个测试样本是正常的。

小结：
如果我们拥有一个测试集x(1),⋯,x(m),我们根据已知的数据集建立模型p(x)，该模型可以将正常样本与异常样本分离。

建立模型

高斯分布(正态分布)

正态分布可以表示成X∼N(μ,δ2),表示X服从均值为μ,方差为δ2的正态分布。

P (x; μ, δ 2) = 1 2 π - - \sqrt δ e x p (- ( x - μ ) 2 2 δ 2)

参数估计：
若有

x(1),⋯,x(m) ，

x(i)∼N(μ,δ2)

μ = 1 m Σ m i = 1 x (i) δ 2 = 1 m Σ m i = 1 (x (i) - μ) 2

证明可以中最大似然估计。

异常检测算法

训练集：x(1),⋯,x(m) ，x(i)∼N(μi,δ2i)
建立模型：

P (X) = P (x (1); μ 1, δ 21) *, . . ., * P (x (m); μ m, δ 2 m) = Π m i = 1 P (x (i); μ i, δ 2 i)

参数拟合：

μ j = 1 m Σ m i = 1 x (i) j δ 2 j = 1 m Σ m i = 1 (x (i) j - μ j) 2

计算

P(X)

P (X) = Π n j = 1 1 2 π - - \sqrt δ j e x p (- ( x j - μ j ) 2 2 δ 2 j)

判断

P(X)是否小于

ϵ，若小于

ϵ则为异常。

异常检测算法的评估

对数据按6:2:2比例进行分配,分别为训练集，交叉验证集，测试集，训练集中全是无标签数据，异常数据在交叉验证集与测试集中按比例进行分配
通过训练集对参数进行拟合
对交叉验证集和测试集中的数据进行测试
由于异常样本的数量非常的少，导致预测十分偏斜，可以通过考察准确率，召回率，F1值来评估模型的效果。
通过交叉验证集来调节参数ϵ

异常检测与监督学习

因为我们可能已经知道了训练数据是否为异常数据，那么就难免有个疑惑我们为什么不用监督学习的算法比如logistics regression来做呢？
下面我们来比较一下异常检测与监督学习

项目异常检测逻辑回归样本异常样本数量少（0~20），大量负样本正负样本数量都很多应用欺诈检测，工业制造，数据中心的监测机器垃圾邮件分类，天气预报，癌症判断

注：大量的正样本可以让算法学习到正样本的特征，并且肯能出现的正样本与训练集中的正样本相似，而异常可能是从未出现过的异常。

数据处理

通常我们先画出特征值的柱状图，看其是否接近与高斯分布，若不是我们可以对特征值进行相关的处理，使其接近于高斯分布，例如取对数，取幂等等。特征值的分布越接近高斯分布则算法的效果越好。

多元高斯分布

我们不再单独考虑每个特征值的高斯分布而是考虑特征向量X的高斯分布

P (X; μ, Σ) = 1 ( 2 π ) 2 n | Σ | 1 2 e x p (- 1 2 (X - μ) τ Σ - 1 (X - μ))

算法流程

参数拟合

μ = 1 m Σ m i = 1 x (i) Σ = 1 m Σ m i = 1 (x (i) - μ) (x (i) - μ) τ

剩下的流程同高斯分布相同。

高斯分布与多元高斯分布比较

高斯分布多元高斯分布需要手动创建新的特征去捕获不正常变量值的组合自动捕获不同特征变量之间的相关性运算亮小，适应n很大的情况，即使m很小也可以运行的很好计算量大，m必须大于n,通常当m>=10时才考虑

注：如果发现Σ是不可逆的一般有两种情况

m< n
有冗余变量（变量间存在线性相关的关系）

MatlabCode

参数拟合Code

function [mu sigma2] = estimateGaussian(X)%ESTIMATEGAUSSIAN This function estimates the parameters of a %Gaussian distribution using the data in X%   [mu sigma2] = estimateGaussian(X), %   The input X is the dataset with each n-dimensional data point in one row%   The output is an n-dimensional vector mu, the mean of the data set%   and the variances sigma^2, an n x 1 vector% % Useful variables[m, n] = size(X);% You should return these values correctlymu = zeros(n, 1);sigma2 = zeros(n, 1);% ====================== YOUR CODE HERE ======================% Instructions: Compute the mean of the data and the variances%               In particular, mu(i) should contain the mean of%               the data for the i-th feature and sigma2(i)%               should contain variance of the i-th feature.%for i=1:n    mu(i)=sum(X(:,i))/m;end;for i=1:n    sigma2(i)=sum((X(:,i)-mu(i)).^2)/m;end;  % =============================================================end

更新ϵ

function [bestEpsilon bestF1] = selectThreshold(yval, pval)%SELECTTHRESHOLD Find the best threshold (epsilon) to use for selecting%outliers%   [bestEpsilon bestF1] = SELECTTHRESHOLD(yval, pval) finds the best%   threshold to use for selecting outliers based on the results from a%   validation set (pval) and the ground truth (yval).%bestEpsilon = 0;bestF1 = 0;F1 = 0;stepsize = (max(pval) - min(pval)) / 1000;for epsilon = min(pval):stepsize:max(pval)    % ====================== YOUR CODE HERE ======================    % Instructions: Compute the F1 score of choosing epsilon as the    %               threshold and place the value in F1. The code at the    %               end of the loop will compare the F1 score for this    %               choice of epsilon and set it to be the best epsilon if    %               it is better than the current choice of epsilon.    %                   % Note: You can use predictions = (pval < epsilon) to get a binary vector    %       of 0's and 1's of the outlier predictions    predicted = (pval<epsilon);    truepostive = sum((predicted==1)&(yval==1));    falsepostive = sum((predicted==1)&(yval==0));    falsenegative = sum((predicted==0)&(yval==1));    pre = truepostive/(truepostive+falsepostive);    rec = truepostive/(truepostive+falsenegative);    F1 = 2*pre*rec/(pre+rec);    % =============================================================    if F1 > bestF1       bestF1 = F1;       bestEpsilon = epsilon;    endendend

1 0