Imbalanced Data

来源：互联网发布：学编程半天不出成果编辑：程序博客网时间：2024/06/03 21:46

1:什么是Imbalanced Data

类不平衡（class-imbalance）是指在训练分类器中所使用的训练集的类别分布不均。比如说一个二分类问题，1000个训练样本，比较理想的情况是正类、负类样本的数量相差不多；而如果正类样本有995个、负类样本仅5个，就意味着存在类不平衡。

have a binary classification problem and one class is present with 60:1 ratio in my training set. I used the logistic regression and the result seems to just ignores one class.

I am working on a classification model. In my dataset I have three different labels to be classified, let them be A, B and C. But in the training dataset I have A dataset with 70% volume, B with 25% and C with 5%. Most of time my results are overfit to A. Can you please suggest how can I solve this problem?

What is Imbalanced Data?

Imbalanced data typically refers to a problem with classification problems where the classes are not represented equally.

For example, you may have a 2-class (binary) classification problem with 100 instances (rows). A total of 80 instances are labeled with Class-1 and the remaining 20 instances are labeled with Class-2.

This is an imbalanced dataset and the ratio of Class-1 to Class-2 instances is 80:20 or more concisely 4:1.

You can have a class imbalance problem on two-class classification problems as well as multi-class classification problems. Most techniques can be used on either.

8 Tactics To Combat Imbalanced Training Data

1) Can You Collect More Data?

You might think it’s silly, but collecting more data is almost always overlooked.

Can you collect more data? Take a second and think about whether you are able to gather more data on your problem.

A larger dataset might expose a different and perhaps more balanced perspective on the classes.

More examples of minor classes may be useful later when we look at resampling your dataset.

2) Try Changing Your Performance Metric

Accuracy is not the metric to use when working with an imbalanced dataset. We have seen that it is misleading.

There are metrics that have been designed to tell you a more truthful story when working with imbalanced classes.

I give more advice on selecting different performance measures in my post “Classification Accuracy is Not Enough: More Performance Measures You Can Use“.

In that post I look at an imbalanced dataset that characterizes the recurrence of breast cancer in patients.

From that post, I recommend looking at the following performance measures that can give more insight into the accuracy of the model than traditional classification accuracy:

Confusion Matrix: A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned).
Precision: A measure of a classifiers exactness.
Recall: A measure of a classifiers completeness
F1 Score (or F-score): A weighted average of precision and recall.

I would also advice you to take a look at the following:

Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance of the classes in the data.
ROC Curves: Like precision and recall, accuracy is divided into sensitivity and specificity and models can be chosen based on the balance thresholds of these values.

You can learn a lot more about using ROC Curves to compare classification accuracy in our post “Assessing and Comparing Classifier Performance with ROC Curves“.

Still not sure? Start with kappa, it will give you a better idea of what is going on than classification accuracy.

3)对数据集进行重采样

可以使用一些策略该减轻数据的不平衡程度。该策略便是采样(sampling)，主要有两种采样方法来降低数据的不平衡性。

对小类的数据样本进行采样来增加小类的数据样本个数，即过采样（over-sampling ，采样的个数大于该类样本的个数）。对大类的数据样本进行采样来减少该类数据样本的个数，即欠采样（under-sampling，采样的次数少于该类样本的个素）。

采样算法往往很容易实现，并且其运行速度快，并且效果也不错。更详细的内容参见这里。
一些经验法则：

考虑对大类下的样本（超过1万、十万甚至更多）进行欠采样，即删除部分样本；

考虑对小类下的样本（不足1为甚至更少）进行过采样，即添加部分样本的副本；

考虑尝试随机采样与非随机采样两种采样方法；
考虑对各类别尝试不同的采样比例，比一定是1:1，有时候1:1反而不好，因为与现实情况相差甚远；

考虑同时使用过采样与欠采样。

4)尝试产生人工数据样本

一种简单的人工样本数据产生的方法便是，对该类下的所有样本每个属性特征的取值空间中随机选取一个组成新的样本，即属性值随机采样。你可以使用基于经验对属性值进行随机采样而构造新的人工样本，或者使用类似朴素贝叶斯方法假设各属性之间互相独立进行采样，这样便可得到更多的数据，但是无法保证属性之前的线性关系（如果本身是存在的）。
有一个系统的构造人工数据样本的方法SMOTE(Synthetic Minority Over-sampling Technique)。SMOTE是一种过采样算法，它构造新的小类样本而不是产生小类中已有的样本的副本，即该算法构造的数据是新样本，原数据集中不存在的。该基于距离度量选择小类别下两个或者更多的相似样本，然后选择其中一个样本，并随机选择一定数量的邻居样本对选择的那个样本的一个属性增加噪声，每次处理一个属性。这样就构造了更多的新生数据。具体可以参见原始论文。
这里有SMOTE算法的多个不同语言的实现版本：

Python: UnbalancedDataset模块提供了SMOTE算法的多种不同实现版本，以及多种重采样算法。
R: DMwR package。

Weka: SMOTE supervised filter。

5）尝试不同的分类算法

强烈建议不要对待每一个分类都使用自己喜欢而熟悉的分类算法。应该使用不同的算法对其进行比较，因为不同的算法使用于不同的任务与数据。具体可以参见“Why you should be Spot-Checking Algorithms on your Machine Learning Problems”。
决策树往往在类别不均衡数据上表现不错。它使用基于类变量的划分规则去创建分类树，因此可以强制地将不同类别的样本分开。目前流行的决策树算法有：C4.5、C5.0、CART和Random Forest等。基于R编写的决策树参见这里。基于Python的Scikit-learn的CART使用参见这里。
6）尝试对模型进行惩罚
你可以使用相同的分类算法，但是使用一个不同的角度，比如你的分类任务是识别那些小类，那么可以对分类器的小类样本数据增加权值，降低大类样本的权值（这种方法其实是产生了新的数据分布，即产生了新的数据集，译者注），从而使得分类器将重点集中在小类样本身上。一个具体做法就是，在训练分类器时，若分类器将小类样本分错时额外增加分类器一个小类样本分错代价，这个额外的代价可以使得分类器更加“关心”小类样本。如penalized-SVM和penalized-LDA算法。
Weka中有一个惩罚模型的通用框架CostSensitiveClassifier，它能够对任何分类器进行封装，并且使用一个自定义的惩罚矩阵对分错的样本进行惩罚。
如果你锁定一个具体的算法时，并且无法通过使用重采样来解决不均衡性问题而得到较差的分类结果。这样你便可以使用惩罚模型来解决不平衡性问题。但是，设置惩罚矩阵是一个复杂的事，因此你需要根据你的任务尝试不同的惩罚矩阵，并选取一个较好的惩罚矩阵。
7）尝试一个新的角度理解问题
我们可以从不同于分类的角度去解决数据不均衡性问题，我们可以把那些小类的样本作为异常点(outliers)，因此该问题便转化为异常点检测(anomaly detection)与变化趋势检测问题(change detection)。
异常点检测即是对那些罕见事件进行识别。如通过机器的部件的振动识别机器故障，又如通过系统调用序列识别恶意程序。这些事件相对于正常情况是很少见的。
变化趋势检测类似于异常点检测，不同在于其通过检测不寻常的变化趋势来识别。如通过观察用户模式或银行交易来检测用户行为的不寻常改变。
将小类样本作为异常点这种思维的转变，可以帮助考虑新的方法去分离或分类样本。这两种方法从不同的角度去思考，让你尝试新的方法去解决问题。
8）尝试创新
仔细对你的问题进行分析与挖掘，是否可以将你的问题划分成多个更小的问题，而这些小问题更容易解决。你可以从这篇文章In classification, how do you handle an unbalanced training set?中得到灵感。例如：

将你的大类压缩成小类；
使用One Class分类器（将小类作为异常点）；
使用集成方式，训练多个分类器，然后联合这些分类器进行分类；

相关链接：

http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

http://blog.csdn.net/heyongluoyao8/article/details/49408131

http://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650718717&idx=1&sn=85038d7c906c135120a8e1a2f7e565ad&scene=0#wechat_redirect

https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650724464&idx=1&sn=1f34358862bacfb4c7ea17c864d8c44d&chksm=871b1c0eb06c95180e717d8316b0380602f638a764530b4b9e35ac812c7c33799d3357d46f00&scene=0&key=0f5e635eeb6bf20a076ad60d7f11c6ef5c5c1c8f02873bc8b458381b629a1e2ae76174d0d4ba34331c71d095e3b3b92aa7fff5e1e11badeaf6c87ff90fd264f3dc6b1eb074eaccb2ac46e8f2d440cefd&ascene=0&uin=MTU1NTY3MTA0Mg%3D%3D&devicetype=iMac+MacBookPro12%2C1+OSX+OSX+10.11.6+build(15G1217)&version=12010310&nettype=WIFI&fontScale=100&pass_ticket=csWk%2BJXfpl7rA8r527fLqF%2BF3EZEeBKpFRjI%2BWMXoPf2PEtPt%2FLMrscLX4GBl7gg

阅读全文

1 0