Common causes of nans during training
来源:互联网 发布:python 爬虫 url去重 编辑:程序博客网 时间:2024/05/13 03:56
原文 https://stackoverflow.com/questions/33962226/common-causes-of-nans-during-training
Good question.
I came across this phenomenon several times. Here are my observations:
Gradient blow up
Reason: large gradients throw the learning process off-track.
What you should expect: Looking at the runtime log, you should look at the loss values per-iteration. You'll notice that the loss starts to grow significantly from iteration to iteration, eventually the loss will be too large to be represented by a floating point variable and it will become nan
.
What can you do: Decrease the base_lr
(in the solver.prototxt) by an order of magnitude (at least). If you have several loss layers, you should inspect the log to see which layer is responsible for the gradient blow up and decrease the loss_weight
(in train_val.prototxt) for that specific layer, instead of the general base_lr
.
Bad learning rate policy and params
Reason: caffe fails to compute a valid learning rate and gets 'inf'
or 'nan'
instead, this invalid rate multiplies all updates and thus invalidating all parameters.
What you should expect: Looking at the runtime log, you should see that the learning rate itself becomes 'nan'
, for example:
... sgd_solver.cpp:106] Iteration 0, lr = -nan
What can you do: fix all parameters affecting the learning rate in your 'solver.prototxt'
file.
For instance, if you use lr_policy: "poly"
and you forget to define max_iter
parameter, you'll end up with lr = nan
...
For more information about learning rate in caffe, see this thread.
Faulty Loss function
Reason: Sometimes the computations of the loss in the loss layers causes nan
s to appear. For example, Feeding InfogainLoss
layer with non-normalized values, using custom loss layer with bugs, etc.
What you should expect: Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan
appears.
What can you do: See if you can reproduce the error, add printout to the loss layer and debug the error.
For example: Once I used a loss that normalized the penalty by the frequency of label occurrence in a batch. It just so happened that if one of the training labels did not appear in the batch at all - the loss computed produced nan
s. In that case, working with large enough batches (with respect to the number of labels in the set) was enough to avoid this error.
Faulty input
Reason: you have an input with nan
in it!
What you should expect: once the learning process "hits" this faulty input - output becomes nan
. Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan
appears.
What can you do: re-build your input datasets (lmdb/leveldn/hdf5...) make sure you do not have bad image files in your training/validation set. For debug you can build a simple net that read the input layer, has a dummy loss on top of it and runs through all the inputs: if one of them is faulty, this dummy net should also produce nan
.
stride larger than kernel size in "Pooling"
layer
For some reason, choosing stride
> kernel_size
for pooling may results with nan
s. For example:
layer { name: "faulty_pooling" type: "Pooling" bottom: "x" top: "y" pooling_param { pool: AVE stride: 5 kernel: 3 }}
results with nan
s in y
.
Instabilities in "BatchNorm"
It was reported that under some settings "BatchNorm"
layer may output nan
s due to numerical instabilities.
This issue was raised in bvlc/caffe and PR #5136 is attempting to fix it.
Recently, I became aware of debug_info
flag: setting debug_info: true
in 'solver.prototxt'
will make caffe print to log more debug information (including gradient magnitudes and activation values) during training: This information can help in spotting gradient blowups and other problems in the training process.
- Common causes of nans during training
- common causes of the JEE application performance problems
- 23.1 Causes of exceptions
- Beware of unknown root causes
- Shutdown Immediate Hangs: Common Causes [ID 106474.1]
- Spring 异常 Couldn't generate CGLIB subclass of class [class spring.dao.UserDAO]: Common causes of this problem include using a f
- JTree.updateUI() during TreeSelectionEvent.valueChanged() causes Null Pointer
- Causes of Oracle Buffer Busy Waits
- Common Struts Errors and Causes--Struts常见错误
- Top 10 Causes of Java EE Enterprise Performance Problems
- [论文笔记]Automated Identification of Failure Causes in System Logs
- * TROUBLESHOOTING: Possible Causes of Poor SQL Performance (文档 ID 33089.1)
- Android Out of Memory Error: Causes, Solution and Best practices
- Android Out of Memory Error: Causes, Solution and Best practices
- Android Out of Memory Error Causes, Solution and Best practices
- Training of GMM
- CrossWord of AM training
- Common Questions Of Gtk+
- 关于微信小程序登陆的问题
- Entity Framework Core 中使用多个DBContext时要注意
- TCP连接建立和释放的有限状态机
- 开发者应该负责多少代码?
- Oracle
- Common causes of nans during training
- 利用GetPrivateProfileString读取配置文件(.ini)
- ios资源
- Java自动化脚本示例1->linkText定位百度登录链接
- MVC5+EF6 入门完整教程八
- hashmap知识~~~待续
- TextView 点击事件无效的完美解决
- CSS技巧(一):清除浮动
- 欢迎使用CSDN-markdown编辑器