cnn architecture

来源：互联网发布：楠木鞋架淘宝编辑：程序博客网时间：2024/06/06 03:07

how to choose a cnn architecture

1

Remember, deeper networks is always better, at the cost of more data and increased complexity of learning.
Minibatch size is usually set of few hundreds.
You should initially use fewer filters and gradually increase and monitor the error rate to see how it is varying.

Very small filter sizes will capture very fine details of the image. On the other hand having a bigger filter size will leave out minute details in the image.

2.
I agree that it involves a lot of intuition, experience and trial and error. My approach is typically to start with a relatively shallow architecture (say 2 convolutional layers + pooling layers, a fully-connected hidden layer and an output layer). I choose the filter sizes and pooling windows in such a way that the representation at the output of the convolutional part of the network is relatively small (i.e. a 4x4 feature map for example). Then, keep adding layers until performance stops improving :) I usually end up with 4 or 5 convolutional layers (with some of them followed by pooling layers), two fully connected hidden layers + output layer. But looking at this year’s ImageNet competition results, it looks like you can go a lot deeper than that for more complex tasks.

>For the number of units on each layer, you typically want to start small so you can evaluate more quickly. That usually means having fewer units in the first layer and an increasing number as you go up in the network (the first few layers get larger input feature maps, so they do more work, hence it’s faster to have fewer units there). Then just increase the number of units until you start overfitting / run out of patience :)

>I think the best approach is to downscale the input a lot initially so you can iterate quickly. You can always check later if increasing the resolution helps.

Unfortunately there is absolutely no general answer to this question. No prinicipal method to determine these hyperparameters is known.
A conventional approach is to look for similar problems and deep learning architectures which have already been shown to work. Than a suitable architecture can be developed by experimentation.
However conventional kernel size’s are 3x3, 5x5 and 7x7. A well known architecture for classification is to use convolution pooling, convolution pooling etc. and some fully connected layers on top. Just start of with a modest number of layers and increase the number while measuring you performance on the test set.

2

The Number of hidden layers: The number of hidden layers required depends on the intrinsic complexity of your dataset, this can be understood by looking at what each layer achieves:

Zero hidden layers allow the network to model only a linear function. This is inadequate for most image recognition tasks.
One hidden layer allows the network to model an arbitrarily complex function. This is adequate for nearly all image recognition tasks.
Theoretically, two hidden layers offer little benefit over a single layer, however, in practice especially complex tasks may find an additional layer beneficial. This should be treated with caution, as a second layer can cause over-fitting. Using more than two hidden layers is almost never beneficial.

enter image description here

To cut a long story short, if you have time then test both one and two hidden layers to see which achieves the most satisfactory results. If you do not have time then you should take a punt on a single hidden layer, and you will not go far wrong.

The Number of convolutional layers: In my experience, the more convolutional layers the better (within reason, as each convolutional layer reduces the number of input features to the fully connected layers), although after about two or three layers the accuracy gain becomes rather small so you need to decide whether your main focus is generalisation accuracy or training time. That said, all image recognition tasks are different so the best method is to simply try incrementing the number of convolutional layers one at a time until you are satisfied by the result.

The number of nodes per hidden layer: …Yet again, there is no magic formula for deciding upon the number of nodes, it is different for each task. A rough guide to go by is to use a number of nodes 2/3 the size of the previous layer, with the first layer 2/3 the size of the final feature maps. This however is just a rough guide and depends again on the dataset. Another commonly used option is to start with an excessive number of nodes, then to remove the unnecessary nodes through pruning.

Max pooling window size: I have always applied max pooling straight after convolution so am perhaps not qualified to make suggestions on the window size you should use. That said, 19x19 max pooling seems overly severe since it literally throws most of your data away. Perhaps you should look at a more conventional LeNet network layout:

http://deeplearning.net/tutorial/lenet.html

https://www.youtube.com/watch?v=n6hpQwq7Inw

In which you repeatedly perform convolution(5x5 or 3x3 usually) followed by max pooling (usually with a 2x2 pooling window, although 4x4 can be necessary for large input images).

In Conclusion The best way to find a suitable network layout is literally to perform trial and error tests. Lots of tests. There is no one-size-fits-all network, and only you know the intrinsic complexity of your dataset. The most effective way of performing the number of necessary tests is through some form of cross validation.

3

To some degree yes, a recent paper came out by Google researchers on how to choose good Inception architectures. Inception nets achieve very high performance on a constrained parameter budget, so this is as good of a place to start as any, and it’s recent. Here’s the link: Rethinking the Inception Architecture for Computer Vision.
They do not offer any hard quantitative rules but rather guidelines that they used and believe have helped them achieve good performance in the recent ImageNet competitions.
For example, some of the principles they discuss are:

* Use stacks of smaller receptive field convolutional layers instead of using a single large receptive field convolutional layers, i.e. 2 stacks of 3x3 conv layers vs a single 7x7 conv layer. This idea isn’t new, it was also discussed in Return of the Devil in the Details: Delving Deep into Convolutional Networks by the Oxford VGG team. This is motivated by the need to be parameter efficient. It also has the dual effect of more representational capacity as we introduce more nonlinearity with more layers.

* Something that I haven’t seen in the literature that this article mentioned is factorizing convolutional layers into deep layers. So instead of having a single 7x7 conv layer, we would have a 1x7 conv layer and then a 7x1 conv layer. Adds more depth, I believe it also parameter efficient as well.

* Balance the depth and width of your net. Use high dimensional representations. This is one of the principles behind their Inception modules, which concatenate multiple convolutinal layers together. So even if you have a small spatial size in your conv net, using Inception modules we can use a high dimensional representation via multi-scale convolutional concatenation: 1x1, 3x3, 3x3-3x3, max pool all put together. These Inception modules have a “width” since they can be interpreted as performing multiple operations in parallel. They go even further with new Inception modules which have factorized convolutional sizes, 1x3, 3x1, etc.

Use 1x1 conv layers (Network in Network style) to reduce dimensionality. They use a lot of dimensionality reduction techniques to achieve parameter efficiency. They believe that this is effective because adjacent feature maps have highly correlated outputs. Which makes sense as natural images are known to exhibit some local statistical properties consistent with this. So reducing dimensionality via 1x1 NIN layers does not have a disastrous effect on representational power.
There’s more in the article. I think it’s an article that can offer some insight as to what you are asking about. They are talking about some very core concepts of conv net architectural design.

阅读全文

0 0