Rethinking "A refinement..."

来源：互联网发布：中级会计网络课程编辑：程序博客网时间：2024/05/17 19:17

a paper "Hierarchically classifying documents using very few words" gives a better explanation about the question why refinement works without overfitting. this paper proposes a new classification method in the manner of hierarchy. the procedure is same as "A refinement approach to handling model misfit in text categorization"(binary classifier) but more complex and manual(note that this is not a binary classifier). the hierarchy is constructed by mutual information and feature selection. following is the main idea:

"...The flattened classifier loses the intuition that topics that are close to each other in the hierarchy have a lot more in common with each other, in general, than topics that are very apart.Therefore, even when it is difficult to find the precise topic of a document, it may be easy to decide whether it is about "agriculture" or about "computers".
...
The key insight is that each of these subtasks is significantly simpler than the original task..."

corresponding to "A refinement...", its procedure is implicit: there is no mutual information to deciding features contained in nodes like decision tree, rahter, like boosting, operating on misclassified examples. the effect should be same: get rid of confusing, noisy and irrelevant examples(or words) by selecting misclassification examples(don't need to considering correct classfifed examples). for binary classification, this explanation is problematic: the category number is one. I think the explanation should be: raher than sematic words noisy, noisy in binary classification due to data skew, the words in training examples is not uniform distribution, so the item P(w|c) is not normlaized. keeping in mind misclassification examples can alleviate this situation.

next problem is overfitting, according to above explanation, it is inevitable. because the words distribution reflected by classifier is just training examples distribution. may be the experiment in "A refinement..." is biased, specially the second data collection Usenet.