Splitting Attribute measure in Decision Tree Learning (ML)

来源:互联网 发布:二叉树的先序遍历算法 编辑:程序博客网 时间:2024/05/29 16:17

Ref: chapter 8 in <<Data Mining. Concepts and Techs> 3rd Ed, by Han, etc.

Information Gain

strong: easy implement

weak: it prefers to select attribute with a large number of values, therefore, the selected splitting attribute might cause a large number of partitions, leading to bad purity. For example, in traffic classification, if using packet size attribute as splitting attribute, the partitions (internal nodes) can include such as 20, 50, 100, 1200, 1500, etc. 

Information Gain+Gain Ratio

strong: improve the weakness of information Gain by using the gain ratio parameter to select splitting attribute with a relatively smaller size. It is a trade-off betweenrespecting to classes andrespecting to outcome partitions.

weak: if the split information approaches 0, the ratio is unstable. So, to avoid this, the information gain selected must be large.

Gini Index

measures the impurity of training data set or a partition. The subset that gives the minimum Gini index for that attribute is selected as its splitting subset.

weak: difficult when the number of classes is large


原创粉丝点击