Advertisement

Derivative of the activation function

Started by February 20, 2003 05:33 AM
7 comments, last by Viro 21 years, 6 months ago
In the standard back propagation algorithm, when you train the hidden layer, my understanding is that you need to calculate the derivative of the activation function. Does this just apply if your activation function is a sigmoid (or bi-polar sigmoid), or does it apply for all (i.e. gaussian, binary, linear, etc)? I'm just a little confused about this. Thanks. EDIT: Made changes to make question sensible. [edited by - Viro on February 20, 2003 10:46:34 AM]
"Linux is not about free software, it is about community," -- Steve Balmer, Microsoft Chief Executive.
I have no idea why you might think that. Where have you got your equations from?



ai-junkie.com
Advertisement
Sorry, I'm probably not getting my idea across properly, so I'll provide some pseudocode of what I'm currently doing. It's not the inverse, its the derivative.

To calculate the output of each output unit, the equation I'm using is

sum_inputs = sum of all (weights * activation of hidden node)
activation = f(sum_inputs).

The functions f is the activation function, that can be sigmoid, binary, etc. At the moment it is sigmoid.

Now in the backpropagation of the error, I need to calculate the error information term, and this involves the derivative of the activation function (I know, my original post wasn't clear. Inverse? wha? Sorry, lack of sleep).

Here's my equation for calculating the error term for each output unit.

error_term = (target_output - actual_output) * f'(sum_inputs).
weight_change = learning_rate * error_term * activation_of_hidden unit.

The problem I'm having is with f'. Do I need it if I'm using other activation functions like linear or binary?


[edited by - Viro on February 20, 2003 12:27:16 PM]
"Linux is not about free software, it is about community," -- Steve Balmer, Microsoft Chief Executive.
To cut a long story short, you need to calculate the direction in which to change the weight which will reduce the error. In order to calculate this direction/gradient, you need the derivative of the activation function.

You were right by saying the error term for an output unit is
error_term = (target_output - actual_output) * f''(sum_inputs)

The derivative of the activation function is required in order to perform gradient descent learning. Without the gradient, you can''t calculate the direction in which to change the weight. Therefore, you cannot perform gradient descent learning if the activation function is not differentiable.

For example, if you''re using the sigmoid activation function, your error_term for the output unit would be calculated as follows:

error_term = (target_output - actual_output) * (actual_output * (1 - actual_output))

where,

f(x) = 1 / (1 + exp(-x)) = sigmoid activation
f''(x) = f(x) * (1 - f(x)) = derivative of sigmoid activation


Hope this helps to clear a few things up

Thanks, I know I need to do that for the sigmoid activation function, but what about the binary step function and the linear activation function?

What do I do with those?
"Linux is not about free software, it is about community," -- Steve Balmer, Microsoft Chief Executive.
A binary step function is discontinous and therefore undifferentiable. Therefore, you can''t use a step function for gradient descent. Gradient descent is neccessary for implementing backpropagtion.

For a linear activation,
f(x) = x
f''(x) = 1

So, your error term for an output unit which has a linear activation would simply be,
error_term = (target_output - actual_output)

Remember, if your network is using linear activations then your network can only solve problems that are linearly separable. Unless you know that your problem is linearly separable, using linear activations is pointless.

The sigmoid activation is nonlinear and hence a network that uses multiple layers of units with sigmoid activations can represent nonlinear functions. Also, a sigmoid activation squashes it''s output between 0 and 1 for the logistic form, and -1 and 1 for the hyberbolic tangent form. For these reasons the sigmoid activation is very popular for neural networks.


Advertisement
I should have added that there is a way of changing the slope of the sigmoid activation function in such a way that it behaves like a step function.

You can define the sigmoid function as,

f(x) = 1 / 1 + exp(a * -x)

where a is a slope parameter constant greater than 0

As the slope parameter approaches infinity, the activation function becomes a step function. However, the sigmoid function can still be differentiated and thus used for backpropagation.

And just in case, the derivation of this activation function is now defined as
f''(x) = a * f(x) * (1 - f(x))


quote: Original post by Mr Nonsense
You can define the sigmoid function as,

f(x) = 1 / 1 + exp(a * -x)


That doesn''t change your neuron much. This is just equivalent to scaling the weights by a. The training will do that automatically for you, if it "thinks" it''s a good idea.

In other words, both networks will converge to the same limit. The one that Mr Nonsense proposed will just have all weights divided by a.

Backpropagation allowed for training of multi layer networks when assuming continuous differential activation function. Before that (1986) networks usually had linear of step activations. Linear means that adding more layers is pointless (as pointed out by MrNonsence). Step means that the derivative is always zero except in the origin (so backprop would have a zero or infinite update of the weights). If you do the multiplication as proposed by MrNonsence than you may end up with the same problem, weights are not updated or are updated too much (how do you choose your learning rate???).

What you should do is initialize you network with low weights (equivalent with dividing all weight by a large variable a as indicated by alvaro). In that way the neuron activation has a slope for a large input area, so that most training inputs will "propose" and update direction of the weights. A neuron with sigmoid activation with small weights mimics a linear neuron (except that the output is 0.5 at the origin, if you do not want this take a tanh, which is a sigmoid that goes from -1 to one and is zero in the origin). So if you have a task that can be solved by a linear network, the weights will change but remain small. It actually will approximate the linear network that can do the job.If training results in higher weights then this tell us that a linear network cannot do the job. The neurons start behaving nonlinear (according to the sigmod) and the result is that the neuron will have a large variable a, meaning that for a part of the input space they have a slope and for the rest the derivative is almost zero. This emans that each neuron in the hidden layer will eventually update based on a (hopefully different) fraction of the train data.

BTW: if you want to understand backprop you have to think about what is being propagated back.
It is the sensitivity of the total input (sum of weights times input values) of each neuron to the error (SSE or MSE). In other words, it is the derivative of the error to the total input. For a neuron in a hidden layer the chain rule applied and if you write it out you''ll see that parts of this is exactly the same as what has to be computed for the layer that appear after it. So by going backward from output to input using backprop is just a smarter way of reusing computations already made. If you just compute the sensitivities you just compute certain values more the once, but the result should be exactly the same.

This topic is closed to new replies.

Advertisement