NG《Machine_Learning_Yearning》笔记

来源：互联网发布：centos下lamp环境搭建编辑：程序博客网时间：2024/05/01 19:31

首先这本书的定位不是入门的机器学习的书，而是对一些有基础有经验的开发者的一本指导手册，设计实验时应该遵循什么样的原则，对于出现结果怎么分析，各个集的错误率分别可以通过什么方法来降低，那些错误率是降低的等

其中一些省略的章节，说明我只是挑重点的过一下。

Chapter 1: Why Machine Learning Strategy.
拿一个识别cat的demo举例，比如识别率比较低怎么办，你的team可能会想到这些方法：

Get more data: Collect more pictures of cats.
Collect a more diverse training set. For example, pictures of cats in unusual positions（不同的位置）; cats with unusual coloration（不同颜色）; pictures shot with a variety of camera settings（改变相机的设置）; ….
Train the algorithm longer, by running more gradient descent iterations. (增加迭代次数)
Try a bigger neural network, with more layers/hidden units/parameters. （增大网络规模）
Try a smaller neural network. （或者是用一个比较小的网络试试）
Try adding regularization (such as L2 regularization). （比如加入正则项，防止过over-fitting）
Change the neural network architecture (activation function, number of hidden units, etc.) (改一下网络结构，比如激活函数等)

Chapter 5: Your development and test sets
当你用网站下载的那些图片做训练集和测试集时（7:3），效果很好，但是当用到手机上之后会因为lower resolution, blurrier, and have less ideal lighting（低分辨率,模糊,不理想的照明）效果变得特别差。
为了解决上面实际的测试集和和训练集的不同，导致的实用的结果很差的情况，我们通常用：
• Training set — Which you run your learning algorithm on.
• Dev (development) set — Which you use to tune parameters（调整参数）, select features, and make other decisions regarding the learning algorithm. Sometimes also called the holdout cross validation set（交叉验证集）.
• Test set — which you use to evaluate the performance of the algorithm, but not to make any decisions about regarding what learning algorithm or parameters to use.
我们所有的实验的测评标准就是dev set 和 test set, 所以这两个集合的选择尤为重要！所以dev set 和 test set 的数据应该是在未来中实际要用到的数据。你可以先让朋友们给你内测，提供一些实际的图片做为dev set 和 test set，有用户之后再用实际的用户数据作为dev set 和 test set。

Chapter 6: Your dev and test sets should come from the same distribution
如果dev set 和 test set不是同样的域的话，那么dev set 调整参数的结果对于test set 只会取得更差的结果。
如果遵循dev set 和 test set from the same distribution 的话，如果dev set效果很好，并且test set结果不好，那肯定就是过拟合了，只需要采用防止over-fitting的trick就ok了，但是如果不是同一分布的话，那就算算法很好了，测试集的效果也不一定好。

Chapter 7: How large do the dev/test sets need to be?
The dev set should be large enough to detect differences between algorithms that you are trying out. Dev sets with sizes from 1,000 to 10,000 examples are common. With 10,000 examples, you will have a good chance of detecting an improvement of 0.1%.
dev集合的数量至少在改变算法的时候能够检测到0.1%的提升，至少 1000个。
测试集越大可信度越高，一般采用全部数据的30％作为测试集，一般的策略是保持测试集和开发集基本一样大小就好了。

Chapter 8: Establish a single-number evaluation metric for your team to optimize
只用一个评级标准，If you really care about both Precision and Recall, I recommend using one of the standard ways to combine them into a single number. 如果有两个，也要合成一个。比如采用一种优化的均值方法：F1_score = 2/((1/Precision)+(1/Recall)). https://en.wikipedia.org/wiki/F1_score.

Chapter 9: Optimizing and satisficing metrics
Here’s another way to combine multiple evaluation metrics.
确定号一个满意指标，一个优化指标。即确定一个指标（satisficing metric）只要在可接受范围之内，就只去另一个指标（optimizing metric）

Chapter 11: When to change dev/test sets and metrics
For example, if your dev set + metric ranks classifier A above classifier B, but your team thinks that classifier B is actually superior for your product, then this might be a sign that you need to change your dev/test sets or your evaluation metric.

There are three main possible causes of the dev set/metric incorrectly rating classifier A higher:
1. The actual distribution you need to do well on is different from the dev/test sets. 开发和测试集需要更多的不同的数据（成熟猫和幼小的猫的都要有）
2. You have overfit to the dev set. 开发集过拟合了
If you find that your dev set performance is much better than your test set performance, it is a sign that you have overfit to the dev set. In this case, get a fresh dev set.
你track team的进度，看到每周或者每月的结果。 But do not use the test set to make any decisions regarding the algorithm, including whether to roll back to the previous week’s system. If you do so, you will start to overfit to the test set。不要用测试集对算法做任何的决定，这句话怎么理解？
3. The metric is measuring something other than what the project needs to optimize.
比如你的A classfier 比 B 要好，但是A没有检测出来一些色情图片，这是绝不允许的，所以这时候我们需要换一个评价指标，将色情图片的惩罚系数设置的大一些。其实这也算对于特殊情况的特殊处理

Chapter 12: Takeaways: Setting up development and test sets
对前面的一些总结

Chapter 13: Error analysis: Look at dev set examples to evaluate ideas
一个识别cat的app, 结果对一些像猫的狗识别不出来，这时候可以采取以下措施来针对其中的狗的照片来做优化：
1. Get a sample of (say) 100 dev set examples that your system misclassified. I.e., examples that your system made an error on.
找出100个识别错误图片。
2. Look at these examples manually, and count what fraction of them are dog images.
manually, 看看这些识别错误的图片中有多少是狗的图片

如果里边有5张是狗的照片，那么无论怎么调整，都会有5%的错误率，5%为 “天花板”，如果本身的错误率是10%（测试集有1000个），那么无论对狗的照片进行怎么优化，都最多能把所有狗的照片识别出来, 那么准确率最多也就提升到 90％＋ 10％＊ 5% ＝ 90.5%，对于这个提升花费那么多时间值得不值得呢？肯定是不值得
但是其中有50张照片是狗的照片，那么优化的好的话可以提升到95% ，这是一个非常大的提升，这个工作是值得的。

所以错误分析很大一部分是需要手动的去验证的，把错误的都拿出来，自己认真分析一下，才能明白系统最应该优先提升的地方是什么！不要嫌麻烦，这会节约猜测验证各种错误的想法的时间。

Chapter 14: Evaluating multiple ideas in parallel during error analysis
Chapter13里说了手动分析错误的sample是很重要的，并且手动分析的时候还要注意的是按错误类型进行分类统计，比如
Dog（含有狗的图片）, Great cat（是狮子或者老虎什么的由于淋雨或者什么像猫）, Blurry（模糊不清的照片识别错误的）等
有些错误可以即属于“Great cat”又属于“Blurry”，最后的结果可能是大雨100， this doesn’t matter.

只要我们分类好了，那么就可以找错误最多的那类进行优化，这样可以在最短的时间内提升最大！

暂时我手头只有14章，先学习真么多，不过NG确实是在特别用心说自己的一些经验，做实验可以依照这些原则的话，可以节约大量的时间，并且有明确的目标。

0 0