神经网络可以拟合任意函数的视觉证明A visual proof that neural nets can compute any function
来源:互联网 发布:装修手机设计软件 编辑:程序博客网 时间:2024/05/22 17:48
One of the most striking facts about neural networks is that they can compute any function at all. That is, suppose someone hands you some complicated, wiggly function,
No matter what the function, there is guaranteed to be a neural network so that for every possible input,
This result holds even if the function has many inputs,
This result tells us that neural networks have a kind of universality. No matter what function we want to compute, we know that there is a neural network which can do the job.
What's more, this universality theorem holds even if we restrict our networks to have just a single layer intermediate between the input and the output neurons - a so-called single hidden layer. So even very simple network architectures can be extremely powerful.
The universality theorem is well known by people who use neural networks. But why it's true is not so widely understood. Most of the explanations available are quite technical. For instance, one of the original papers proving the result**Approximation by superpositions of a sigmoidal function, by George Cybenko (1989). The result was very much in the air at the time, and several groups proved closely related results. Cybenko's paper contains a useful discussion of much of that work. Another important early paper is Multilayer feedforward networks are universal approximators, by Kurt Hornik, Maxwell Stinchcombe, and Halbert White (1989). This paper uses the Stone-Weierstrass theorem to arrive at similar results. did so using the Hahn-Banach theorem, the Riesz Representation theorem, and some Fourier analysis. If you're a mathematician the argument is not difficult to follow, but it's not so easy for most people. That's a pity, since the underlying reasons for universality are simple and beautiful.
In this chapter I give a simple and mostly visual explanation of the universality theorem. We'll go step by step through the underlying ideas. You'll understand why it's true that neural networks can compute any function. You'll understand some of the limitations of the result. And you'll understand how the result relates to deep neural networks.
To follow the material in the chapter, you do not need to have read earlier chapters in this book. Instead, the chapter is structured to be enjoyable as a self-contained essay. Provided you have just a little basic familiarity with neural networks, you should be able to follow the explanation. I will, however, provide occasional links to earlier material, to help fill in any gaps in your knowledge.
Universality theorems are a commonplace in computer science, so much so that we sometimes forget how astonishing they are. But it's worth reminding ourselves: the ability to compute an arbitrary function is truly remarkable. Almost any process you can imagine can be thought of as function computation. Consider the problem of naming a piece of music based on a short sample of the piece. That can be thought of as computing a function. Or consider the problem of translating a Chinese text into English. Again, that can be thought of as computing a function**Actually, computing one of many functions, since there are often many acceptable translations of a given piece of text.. Or consider the problem of taking an mp4 movie file and generating a description of the plot of the movie, and a discussion of the quality of the acting. Again, that can be thought of as a kind of function computation**Ditto the remark about translation and there being many possible functions.. Universality means that, in principle, neural networks can do all these things and many more.
Of course, just because we know a neural network exists that can (say) translate Chinese text into English, that doesn't mean we have good techniques for constructing or even recognizing such a network. This limitation applies also to traditional universality theorems for models such as Boolean circuits. But, as we've seen earlier in the book, neural networks have powerful algorithms for learning functions. That combination of learning algorithms + universality is an attractive mix. Up to now, the book has focused on the learning algorithms. In this chapter, we focus on universality, and what it means.
Two caveats
Before explaining why the universality theorem is true, I want to mention two caveats to the informal statement "a neural network can compute any function".
First, this doesn't mean that a network can be used to exactlycompute any function. Rather, we can get an approximation that is as good as we want. By increasing the number of hidden neurons we can improve the approximation. For instance, earlier I illustrated a network computing some function
And we can do still better by further increasing the number of hidden neurons.
To make this statement more precise, suppose we're given a function
The second caveat is that the class of functions which can be approximated in the way described are the continuous functions. If a function is discontinuous, i.e., makes sudden, sharp jumps, then it won't in general be possible to approximate using a neural net. This is not surprising, since our neural networks compute continuous functions of their input. However, even if the function we'd really like to compute is discontinuous, it's often the case that a continuous approximation is good enough. If that's so, then we can use a neural network. In practice, this is not usually an important limitation.
Summing up, a more precise statement of the universality theorem is that neural networks with a single hidden layer can be used to approximate any continuous function to any desired precision. In this chapter we'll actually prove a slightly weaker version of this result, using two hidden layers instead of one. In the problems I'll briefly outline how the explanation can, with a few tweaks, be adapted to give a proof which uses only a single hidden layer.
Universality with one input and one output
To understand why the universality theorem is true, let's start by understanding how to construct a neural network which approximates a function with just one input and one output:
It turns out that this is the core of the problem of universality. Once we've understood this special case it's actually pretty easy to extend to functions with many inputs and many outputs.
To build insight into how to construct a network to compute
To get a feel for how components in the network work, let's focus on the top hidden neuron. In the diagram below, click on the weight,
As we learnt earlier in the book, what's being computed by the hidden neuron is
To get started on this proof, try clicking on the bias,
Next, click and drag to the left in order to decrease the bias. You'll see that as the bias decreases the graph moves to the right, but, again, its shape doesn't change.
Next, decrease the weight to around
Finally, increase the weight up past
We can simplify our analysis quite a bit by increasing the weight so much that the output really is a step function, to a very good approximation. Below I've plotted the output from the top hidden neuron when the weight is
It's actually quite a bit easier to work with step functions than general sigmoid functions. The reason is that in the output layer we add up contributions from all the hidden neurons. It's easy to analyze the sum of a bunch of step functions, but rather more difficult to reason about what happens when you add up a bunch of sigmoid shaped curves. And so it makes things much easier to assume that our hidden neurons are outputting step functions. More concretely, we do this by fixing the weight
At what value of
To answer this question, try modifying the weight and bias in the diagram above (you may need to scroll back a bit). Can you figure out how the position of the step depends on
In fact, the step is at position
It will greatly simplify our lives to describe hidden neurons using just a single parameter,
As noted above, we've implicitly set the weight
Up to now we've been focusing on the output from just the top hidden neuron. Let's take a look at the behavior of the entire network. In particular, we'll suppose the hidden neurons are computing step functions parameterized by step points
What's being plotted on the right is the weighted output
Try increasing and decreasing the step point
Similarly, try manipulating the step point
Try increasing and decreasing each of the output weights. Notice how this rescales the contribution from the respective hidden neurons. What happens when one of the weights is zero?
Finally, try setting
Of course, we can rescale the bump to have any height at all. Let's use a single parameter,
Try changing the value of
You'll notice, by the way, that we're using our neurons in a way that can be thought of not just in graphical terms, but in more conventional programming terms, as a kind of if-then-elsestatement, e.g.:
if input >= step point: add 1 to the weighted output else: add 0 to the weighted output
For the most part I'm going to stick with the graphical point of view. But in what follows you may sometimes find it helpful to switch points of view, and think about things in terms of if-then-else.
We can use our bump-making trick to get two bumps, by gluing two pairs of hidden neurons together into the same network:
I've suppressed the weights here, simply writing the
More generally, we can use this idea to get as many peaks as we want, of any height. In particular, we can divide the interval
You can see that there are five pairs of hidden neurons. The step points for the respective pairs of neurons are
Each pair of neurons has a value of
Contrariwise, try clicking on the graph, and dragging up or down to change the height of any of the bump functions. As you change the heights, you can see the corresponding change in
In other words, we can directly manipulate the function appearing in the graph on the right, and see that reflected in the
Time for a challenge.
Let's think back to the function I plotted at the beginning of the chapter:
I didn't say it at the time, but what I plotted is actually the function
That's obviously not a trivial function.
You're going to figure out how to compute it using a neural network.
In our networks above we've been analyzing the weighted combination
The solution is to design a neural network whose hidden layer has a weighted output given by
If we can do this, then the output from the network as a whole will be a good approximation to
Your challenge, then, is to design a neural network to approximate the goal function shown just above. To learn as much as possible, I want you to solve the problem twice. The first time, please click on the graph, directly adjusting the heights of the different bump functions. You should find it fairly easy to get a good match to the goal function. How well you're doing is measured by the average deviation between the goal function and the function the network is actually computing. Your challenge is to drive the average deviation as low as possible. You complete the challenge when you drive the average deviation to
Once you've done that, click on "Reset" to randomly re-initialize the bumps. The second time you solve the problem, resist the urge to click on the graph. Instead, modify the
You've now figured out all the elements necessary for the network to approximately compute the function
In particular, it's easy to convert all the data we have found back into the standard parameterization used for neural networks. Let me just recap quickly how that works.
The first layer of weights all have some large, constant value, say
The biases on the hidden neurons are just
The final layer of weights are determined by the
Finally, the bias on the output neuron is
That's everything: we now have a complete description of a neural network which does a pretty good job computing our original goal function. And we understand how to improve the quality of the approximation by improving the number of hidden neurons.
What's more, there was nothing special about our original goal function,
Many input variables
Let's extend our results to the case of many input variables. This sounds complicated, but all the ideas we need can be understood in the case of just two inputs. So let's address the two-input case.
We'll start by considering what happens when we have two inputs to a neuron:
Here, we have inputs
x=1y=1Output
As you can see, with
Given this, what do you think happens when we increase the weight
Just as in our earlier discussion, as the input weight gets larger the output approaches a step function. The difference is that now the step function is in three dimensions. Also as before, we can move the location of the step point around by modifying the bias. The actual location of the step point is
Let's redo the above using the position of the step as the parameter:
x=1y=1Output
Here, we assume the weight on the
x=1y=1Output
The number on the neuron is again the step point, and in this case the little
We can use the step functions we've just constructed to compute a three-dimensional bump function. To do this, we use two neurons, each computing a step function in the
x=1y=1Weighted output from hidden layer
Try changing the value of the height,
Also, try changing the step point
We've figured out how to make a bump function in the
x=1y=1Weighted output from hidden layer
This looks nearly identical to the earlier network! The only thing explicitly shown as changing is that there's now little
Let's consider what happens when we add up two bump functions, one in the
x=1y=1Weighted output from hidden layer
To simplify the diagram I've dropped the connections with zero weight. For now, I've left in the little
Try varying the parameter
What we've built looks a little like a tower function:
If we could build such tower functions, then we could use them to approximate arbitrary functions, just by adding up many towers of different heights, and in different locations:
Of course, we haven't yet figured out how to build a tower function. What we have constructed looks like a central tower, of height
But we can make a tower function. Remember that earlier we saw neurons can be used to implement a type of if-then-else statement:
if input >= threshold: output 1 else: output 0
That was for a neuron with just a single input. What we want is to apply a similar idea to the combined output from the hidden neurons:
if combined output from hidden neurons >= threshold: output 1 else: output 0
If we choose the threshold appropriately - say, a value of
Can you see how to do this? Try experimenting with the following network to figure it out. Note that we're now plotting the output from the entire network, not just the weighted output from the hidden layer. This means we add a bias term to the weighted output from the hidden layer, and apply the sigma function. Can you find values for
x=1y=1Output
With our initial parameters, the output looks like a flattened version of the earlier diagram, with its tower and plateau. To get the desired behaviour, we increase the parameter
Here's what it looks like, when we use
Even for this relatively modest value of
Let's try gluing two such networks together, in order to compute two different tower functions. To make the respective roles of the two sub-networks clear I've put them in separate boxes, below: each box computes a tower function, using the technique described above. The graph on the right shows the weighted output from thesecond hidden layer, that is, it's a weighted combination of tower functions.
x=1y=1Weighted output
In particular, you can see that by modifying the weights in the final layer you can change the height of the output towers.
The same idea can be used to compute as many towers as we like. We can also make them as thin as we like, and whatever height we like. As a result, we can ensure that the weighted output from the second hidden layer approximates any desired function of two variables:
In particular, by making the weighted output from the second hidden layer a good approximation to
What about functions of more than two variables?
Let's try three variables
Here, the
This network computes a function which is
By gluing together many such networks we can get as many towers as we want, and so approximate an arbitrary function of three variables. Exactly the same idea works in
Okay, so we now know how to use neural networks to approximate a real-valued function of many variables. What about vector-valued functions
Problem
- We've seen how to use networks with two hidden layers to approximate an arbitrary function. Can you find a proof showing that it's possible with just a single hidden layer? As a hint, try working in the case of just two input variables, and showing that: (a) it's possible to get step functions not just in the
x x ory y directions, but in an arbitrary direction; (b) by adding up many of the constructions from part (a) it's possible to approximate a tower function which is circular in shape, rather than rectangular; (c) using these circular towers, it's possible to approximate an arbitrary function. To do part (c) it may help to use ideas from a bit later in this chapter.
Extension beyond sigmoid neurons
We've proved that networks made up of sigmoid neurons can compute any function. Recall that in a sigmoid neuron the inputs
What if we consider a different type of neuron, one using some other activation function,
That is, we'll assume that if our neurons has inputs
We can use this activation function to get a step function, just as we did with the sigmoid. Try ramping up the weight in the following, say to
Just as with the sigmoid, this causes the activation function to contract, and ultimately it becomes a very good approximation to a step function. Try changing the bias, and you'll see that we can set the position of the step to be wherever we choose. And so we can use all the same tricks as before to compute any desired function.
What properties does
Problems
- Earlier in the book we met another type of neuron known as arectified linear unit. Explain why such neurons don't satisfy the conditions just given for universality. Find a proof of universality showing that rectified linear units are universal for computation.
- Suppose we consider linear neurons, i.e., neurons with the activation function
s(z)=z s(z)=z. Explain why linear neurons don't satisfy the conditions just given for universality. Show that such neurons can't be used to do universal computation.
Fixing up the step functions
Up to now, we've been assuming that our neurons can produce step functions exactly. That's a pretty good approximation, but it is only an approximation. In fact, there will be a narrow window of failure, illustrated in the following graph, in which the function behaves very differently from a step function:
In these windows of failure the explanation I've given for universality will fail.
Now, it's not a terrible failure. By making the weights input to the neurons big enough we can make these windows of failure as small as we like. Certainly, we can make the window much narrower than I've shown above - narrower, indeed, than our eye could see. So perhaps we might not worry too much about this problem.
Nonetheless, it'd be nice to have some way of addressing the problem.
In fact, the problem turns out to be easy to fix. Let's look at the fix for neural networks computing functions with just one input and one output. The same ideas work also to address the problem when there are more inputs and outputs.
In particular, suppose we want our network to compute some function,
If we were to do this using the technique described earlier, we'd use the hidden neurons to produce a sequence of bump functions:
Again, I've exaggerated the size of the windows of failure, in order to make them easier to see. It should be pretty clear that if we add all these bump functions up we'll end up with a reasonable approximation to
Suppose that instead of using the approximation just described, we use a set of hidden neurons to compute an approximation to halfour original goal function, i.e., to
And suppose we use another set of hidden neurons to compute an approximation to
Now we have two different approximations to
We could do even better by adding up a large number,
Conclusion
The explanation for universality we've discussed is certainly not a practical prescription for how to compute using neural networks! In this, it's much like proofs of universality for NAND gates and the like. For this reason, I've focused mostly on trying to make the construction clear and easy to follow, and not on optimizing the details of the construction. However, you may find it a fun and instructive exercise to see if you can improve the construction.
Although the result isn't directly useful in constructing networks, it's important because it takes off the table the question of whether any particular function is computable using a neural network. The answer to that question is always "yes". So the right question to ask is not whether any particular function is computable, but rather what's a good way to compute the function.
The universality construction we've developed uses just two hidden layers to compute an arbitrary function. Furthermore, as we've discussed, it's possible to get the same result with just a single hidden layer. Given this, you might wonder why we would ever be interested in deep networks, i.e., networks with many hidden layers. Can't we simply replace those networks with shallow, single hidden layer networks?
While in principle that's possible, there are good practical reasons to use deep networks. As argued in Chapter 1, deep networks have a hierarchical structure which makes them particularly well adapted to learn the hierarchies of knowledge that seem to be useful in solving real-world problems. Put more concretely, when attacking problems such as image recognition, it helps to use a system that understands not just individual pixels, but also increasingly more complex concepts: from edges to simple geometric shapes, all the way up through complex, multi-object scenes. In later chapters, we'll see evidence suggesting that deep networks do a better job than shallow networks at learning such hierarchies of knowledge. To sum up: universality tells us that neural networks can compute any function; and empirical evidence suggests that deep networks are the networks best adapted to learn the functions useful in solving many real-world problems.
from: http://neuralnetworksanddeeplearning.com/chap4.html
- 神经网络可以拟合任意函数的视觉证明A visual proof that neural nets can compute any function
- CHAPTER 4 A visual proof that neural nets can compute any function
- 2016.3.29 一个简单的视觉化证明神经网络可以拟合任意函数
- 单隐层神经网络拟合任意函数
- 可视化证明神经网络可以计算任何函数
- [神经网络]1.4-Using neural nets to recognize handwritten digits-A simple network to classify ...(翻译)
- 深入探究递归神经网络RNN A Deep Dive into Recurrent Neural Nets
- A Deep Dive into Recurrent Neural Nets
- BP神经网络的非线性函数拟合
- BP神经网络函数拟合
- BP神经网络拟合函数
- 神经网络与深度学习笔记——第4章 神经网络可以计算任何函数的可视化证明
- 卷积神经网络用于视觉识别Convolutional Neural Networks for Visual Recognition
- 斯坦福CS231n课程: 视觉识别中的卷积神经网络 Convolutional Neural Networks for Visual Recognition
- A Proof That P Is Not Equal To NP?
- [神经网络]1.3-Using neural nets to recognize handwritten digits-The architecture of neural networks(翻译)
- [NOIP模拟] 证明 proof
- Matlab一个利用神经网络拟合函数的例子
- 开发工具带来的进度影响
- Web后端语言模拟http请求(带用户名和密码)实例代码大全
- Codeforces 631D - Messenger KMP
- 安卓开发小知识-AppWidget探索
- Java基础之六:关键字this、super、static
- 神经网络可以拟合任意函数的视觉证明A visual proof that neural nets can compute any function
- 华为2014年校园招聘机试题(2)
- vim下接下Ctrl+S造成程序僵死
- 虚函数和纯虚函数的作用与区别
- Jenkins进阶系列之——17Jenkins升级、迁移和备份
- Terminal Prompt
- java基础之七:
- leetcode Longest Palindromic Substring 005
- Jenkins进阶系列之——18Jenkins语言本地化