所有的机器学习模型都有缺陷(by John langford)（zz）

来源：互联网发布：自学程序员的第一步编辑：程序博客网时间：2024/09/21 06:35

一个很不错的总结~

Attempts to abstract andstudy machine learning are within some given framewok or mathematicalmodel. It turns out that all of these models are significantly flawedfor the purpose of studying machine learning. I’ve created a table(below) outlining the major flaws in some common models of machinelearning.

The point here is not simply “woe unto us”. There are several implications which seem important.

The multitude of models is a point of continuing confusion. It iscommon for people to learn about machine learning within one frameworkwhich often becomes there “home framework” through which they attemptto filter all machine learning. (Have you met people who can only thinkin terms of kernels? Only via Bayes Law? Only via PAC Learning?)Explicitly understanding the existence of these other frameworks canhelp resolve the confusion. This is particularly important whenreviewing and particularly important for students.
Algorithms which conform to multiple approaches can havesubstantial value. “I don’t really understand it yet, because I onlyunderstand it one way”. Reinterpretation alone is not the goal—we wantalgorithmic guidance.
We need to remain constantly open to new mathematical modelsof machine learning. It’s common to forget the flaws of the model thatyou are most familiar with in evaluating other models while the flawsof new models get exaggerated. The best way to avoid this is simplyeducation.
The value of theory alone is more limited than manytheoreticians may be aware. Theories need to be tested to see if theycorrectly predict the underlying phenomena.

Here is a summary what is wrong with various frameworks forlearning. To avoid being entirely negative, I added a column aboutwhat’s right as well.

NameMethodologyWhat’s rightWhat’s wrongBayesian LearningYou specify a prior probability distribution over data-makers, P(datamaker) then use Bayes law to find a posterior P(datamaker|x).True Bayesians integrate over the posterior to make predictions whilemany simply use the world with largest posterior directly.Handles the small data limit. Very flexible. Interpolates to engineering.

Information theoretically problematic. Explicitly specifying a reasonable prior is often hard.
Computationally difficult problems are commonly encountered.
Human intensive. Partly due to the difficulties above andpartly because “first specify a prior” is built into framework thisapproach is not very automatable.

Graphical/generative ModelsSometimes Bayesian and sometimes not. Data-makers are typicallyassumed to be IID samples of fixed or varying length data. Data-makersare represented graphically with conditional independencies encoded inthe graph. For some graphs, fast algorithms for making (orapproximately making) predictions exist.Relative to pure Bayesian systems, this approach is sometimescomputationally tractable. More importantly, the graph language isnatural, which aids prior elicitation.

Often (still) fails to fix problems with the Bayesian approach.
In real world applications, true conditional independence israre, and results degrade rapidly with systematic misspecification ofconditional independence.

Convex Loss OptimizationSpecify a loss function related to the world-imposed loss fucntionwhich is convex on some parametric predictive system. Optimize theparametric predictive system to find the global optima. Mathematically clean solutions where computational tractability is partly taken into account. Relatively automatable.

The temptation to forget that the world imposes nonconvex lossfunctions is sometimes overwhelming, and the mismatch is alwaysdangerous.
Limited models. Although switching to a convex loss meansthat some optimizations become convex, optimization on representationswhich aren’t single layer linear combinations is often difficult.

Gradient DescentSpecify an architecture with free parameters and use gradient descent with respect to data to tune the parameters.Relatively computationally tractable due to (a) modularity ofgradient descent (b) directly optimizing the quantity you want topredict.

Finicky. There are issues with paremeter initialization, step size,and representation. It helps a great deal to have accumulatedexperience using this sort of system and there is little theoreticalguidance.
Overfitting is a significant issue.

Kernel-based learningYou chose a kernel K(x,x’) between datapoints that satisfies certain conditions, and then use it as a measure of similarity when learning.People often find the specification of a similarity functionbetween objects a natural way to incorporate prior information formachine learning problems. Algorithms (like SVMs) for training arereasonably practical—O(n²) for instance.Specification of the kernel is not easy for some applications (this is another example of prior elicitation). O(n²) is not efficient enough when there is much data.BoostingYou create a learning algorithm that may be imperfect but which hassome predictive edge, then apply it repeatedly in various ways to makea final predictor.A focus on getting something that works quickly is natural. Thisapproach is relatively automated and (hence) easy to apply forbeginners.The boosting framework tells you nothing about how to build thatinitial algorithm. The weak learning assumption becomes violated atsome point in the iterative process.Online Learning with ExpertsYou make many base predictors and then a master algorithmautomatically switches between the use of these predictors so as tominimize regret.This is an effective automated method to extract performance from a pool of predictors.Computational intractability can be a problem. This approach livesand dies on the effectiveness of the experts, but it provides little orno guidance in their construction.Learning ReductionsYou solve complex machine learning problems by reducing them to well-studied base problems in a robust manner.The reductions approach can yield highly automated learning algorithms.The existence of an algorithm satisfying reduction guarantees isnot sufficient to guarantee success. Reductions tell you little ornothing about the design of the base learning algorithm.PAC LearningYou assume that samples are drawn IID from an unknown distribution D.You think of learning as finding a near-best hypothesis amongst a givenset of hypotheses in a computationally tractable manner. The focus on computation is pretty right-headed, because we are ultimately limited by what we can compute.There are not many substantial positive results, particularly when D is noisy. Data isn’t IID in practice anyways.Statistical Learning TheoryYou assume that samples are drawn IID from an unknown distribution D.You think of learning as figuring out the number of samples required todistinguish a near-best hypothesis from a set of hypotheses.There are substantially more positive results than for PACLearning, and there are a few examples of practical algorithms directlymotivated by this analysis.The data is not IID. Ignorance of computational difficulties oftenresults in difficulty of application. More importantly, the bounds areoften loose (sometimes to the point of vacuousness).Decision tree learningLearning is a process of cutting up the input space and assigning predictions to pieces of the space.Decision tree algorithms are well automated and can be quite fast.There are learning problems which can not be solved by decisiontrees, but which are solvable. It’s common to find that otherapproaches give you a bit more performance. A theoretical grounding formany choices in these algorithms is lacking.Algorithmic complexityLearning is about finding a program which correctly predicts the outputs given the inputs.Any reasonable problem is learnable with a number of samples related to the description length of the program.The theory literally suggests solving halting problems to solve machine learning.RL, MDP learningLearning is about finding and acting according to a near optimal policy in an unknown Markov Decision Process.We can learn and act with an amount of summed regret related to O(SA) where S is the number of states and A is the number of actions per state.Has anyone counted the number of states in real world problems? Wecan’t afford to wait that long. Discretizing the states creates a POMDP(see below). In the real world, we often have to deal with a POMDPanyways.RL, POMDP learningLearning is about finding and acting according to a near optimaly policy in a Partially Observed Markov Decision ProcessIn a sense, we’ve made no assumptions, so algorithms have wide applicability.All known algorithms scale badly with the number of hidden states.