Error-Backpropagation

Introduction

Much early research in networks was abandoned because of the severe limitations of single layer linear networks. Multilayer networks were not "discovered" until much later but even then there were no good training algorithms. It was not until the `80s that backpropagation became widely known.
People in the field joke about this because backprop is really just applying the chain rule to compute the gradient of the cost function. How many years should it take to rediscover the chain rule?? Of course, it isn't really this simple. Backprop also refers to the very efficient method that was discovered for computing the gradient.

 Note: Multilayer nets are much harder to train than single layer networks. That is, convergence is much slower and speed-up techniques are more complicated.

 Method of Training: Backpropagation

 Define a cost function (e.g. mean square error)

where the activation y at the output layer is given by

and where

Written out more explicitly, the cost function is

or all at once:

Computing the gradient: for the hidden-to-output weights:

the gradient: for the input-to-hidden weights:

Summary of Gradients

hidden-to-output weights:

where

input-to-hidden:

where