This part describes single layer neural networks, including some of the classical approaches

to the neural computing and learning problem. In the first part of this chapter we discuss the

representational power of the single layer networks and their learning algorithms and will give

some examples of using the networks. In the second part we will discuss the representational

limitations of single layer networks.

Two ‘classical’ models will be described in the first part of the chapter: the Perceptron,

proposed by Rosenblatt (Rosenblatt, 1959) in the late 50’s and the Adaline, presented in the

early 60’s by by Widrow and Hoff (Widrow & Hoff, 1960).

### Networks with threshold activation functions

A single layer feed-forward network consists of one or more output neurons o, each of which is

connected with a weighting factor wio to all of the inputs i. In the simplest case the network

has only two inputs and a single output, as sketched in figure:

(we leave the output index o

out). The input of the neuron is the weighted sum of the inputs plus the bias term. The output of the network is formed by the activation of the output neuron, which is some function of the

input:

The activation function F can be linear so that we have a linear network, or nonlinear. In this

section we consider the threshold (or Heaviside or sgn) function:

The output of the network thus is either +1 or -1 depending on the input. The network

can now be used for a classication task: it can decide whether an input pattern belongs to

one of two classes. If the total input is positive, the pattern will be assigned to class +1, if the total input is negative, the sample will be assigned to class -1.The separation between the two

classes in this case is a straight line, given by the equation:

We will describe two learning methods for these types of networks: the ‘perceptron’

learning rule and the ‘delta’ or ‘LMS’ rule. Both methods are iterative procedures that adjust

the weights. A learning sample is presented to the network. For each weight the new value is

computed by adding a correction to the old value. The threshold is updated in a same way:

### Perceptron learning rule and convergence theorem

Suppose we have a set of learning samples consisting of an input vector x and a desired output

d(x). For a classification task the d(x) is usually +1 or -1.The perceptron learning rule is very

simple and can be stated as follows:

- Start with random weights for the connections;
- Select an input vector x from the set of training samples;
- If y ≠d(x) (the perceptron gives an incorrect response), modify all connections wi according

to: Δwi = d(x)xi; - Go back to 2.

Note that the procedure is very similar to the Hebb rule; the only dierence is that, when the

network responds correctly, no connection weights are modied. Besides modifying the weights,

we must also modify the threshold θ. This θ is considered as a connection w0 between the output

neuron and a ‘dummy’ predicate unit which is always on: x0 = 1. Given the perceptron learning

rule as stated above, this threshold is modified according to:

### The adaptive linear element (Adaline)

An important generalisation of the perceptron training algorithm was presented by Widrow and

Hoff as the ‘least mean square’ (LMS) learning procedure, also known as the delta rule. The

main functional diference with the perceptron training rule is the way the output of the system is

used in the learning rule. The perceptron learning rule uses the output of the threshold function (either -1 or +1) for learning.The delta-rule uses the net output without further mapping into

output values -1 or +1.The learning rule was applied to the ‘adaptive linear element,’ also named Adaline2, developed

by Widrow and Hoff (Widrow & Hoff, 1960). In a simple physical implementation

this device consists of a set of controllable resistors connected to a circuit which can sum up

currents caused by the input voltage signals. Usually the central block, the summer, is also

followed by a quantiser which outputs either +1 of -1,depending on the polarity of the sum.

Although the adaptive process is here exemplified in a case when there is only one output,

it may be clear that a system with many parallel outputs is directly implementable by multiple

units of the above kind.

If the input conductances are denoted by wi, i = 0; 1; : : : ; n, and the input and output signals by xi and y, respectively, then the output of the central block is defined to be:

where θ = w0. The purpose of this device is to yield a given value y = dp at its output when

the set of values xp

i , i = 1,2….. , n, is applied at the inputs. The problem is to determine the

coeficients wi, i = 0, 1……., n, in such a way that the input-output response is correct for a large

number of arbitrarily chosen signal sets. If an exact mapping is not possible, the average error

must be minimised, for instance, in the sense of least squares. An adaptive operation means

that there exists a mechanism by which the wi can be adjusted, usually iteratively, to attain the

correct values.

### Networks with linear activation functions: the delta rule

For a single layer network with an output unit with a linear activation function the output is

simply given by:

Such a simple network is able to represent a linear relationship between the value of the

output unit and the value of the input units. By thresholding the output value, a classifier can

be constructed (such as Widrow’s Adaline), but here we focus on the linear relationship and use

the network for a function approximation task. In high dimensional input spaces the network

represents a (hyper)plane and it will be clear that also multiple output units may be defined.

Suppose we want to train the network such that a hyperplane is fitted as well as possible

to a set of training samples consisting of input values xp and desired (or target) output values

dp. For every given input sample, the output of the network difers from the target value dp

by where **Yp** is the actual output for this pattern. The delta-rule now uses a cost- or

error-function based on these dierences to adjust the weights.

The error function, as indicated by the name least mean square, is the summed squared

error. That is, the total error E is dened to be

where the index p ranges over the set of input patterns and Ep represents the error on pattern

p. The LMS procedure finds the values of all the weights that minimise the error function by a

method called gradient descent. The idea is to make a change in the weight proportional to the

negative of the derivative of the error as measured on the current pattern with respect to each

weight:

where γ is a constant of proportionality. The derivative is

Because of the linear units

where is the diference between the target output and the actual output for pattern

p.

The delta rule modifies weight appropriately for target and actual outputs of either polarity

and for both continuous and binary input and output units. These characteristics have opened

up a wealth of new applications.