Intelligent Systems
Neural Networks
Agenda
 Motivation
 Technical Solution
 (Artificial) Neural Networks
 Perceptron
 Artificial Neural Network Structures
 Learning and Generalization
 Expressiveness of MultiLayer Perceptrons
 Illustration by Larger Examples
 Summary
MOTIVATION
Motivation

A main motivation behind neural networks is the fact that symbolic rules do not reflect reasoning processes performed by humans.

Biological neural systems can capture highly parallel computations based on representations that are distributed over many neurons.

They learn and generalize from training data; no need for programming it all...

They are very noise tolerant – better resistance than symbolic systems.
Motivation

Neural networks are stong in:
 Learning from a set of examples
 Optimizing solutions via constraints and cost functions
 Classification: grouping elements in classes
 Speech recognition, pattern matching
 Nonparametric statistical analysis and regressions
TECHNICAL SOLUTIONS
(Artificial) Neural Networks
What are Neural Networks?
 A biological neural network is composed of groups of chemically connected or functionally associated neurons.
 A single neuron may be connected to many other neurons and the total number of neurons and connections in a network may be extensive.
 Neurons are highly specialized cells that transmit impulses within animals to cause a change in a target cell such as a muscle effector cell or a glandular cell.
 Impulses are noisy “spike trains” of electronical potential.
 Apart from the electrical signaling, neurotransmitter diffusion (endogenous chemicals) can effectuate the impulses.
What are Neural Networks?
 Connections, called synapses , are usually formed from axons to dendrites ; also other connections are possible.
 The axon is the primary conduit through which the neuron transmits impulses to neurons downstream in the signal chain
 A human has 1011 neurons of > 20 types, 1014 synapses, 1ms10ms cycle time.
 There are 1011 in a galaxy, and 1011 galaxies in the universe.

Not only the impulse carries value
but also the frequency of
oscillation, hence neural
networks have an even
higher complexity.
What are Neural Networks?

Metaphorically speaking, a thought is a specific selfoscillation of a network of neurons.

The topology of the network determines its resonance.

However, it is the resonance in the brain's interaction with the environment and with itself that creates, reinforces or decouples interaction patterns.

The brain is not a static device, but a device that is created through usage…
What are Neural Networks?

What we refer to as Neural Networks, in this course, are mostly Artificial Neural Networks (ANN).

ANN are approximations of biological neural networks and are built of physical devices, or simulated on computers.

ANN are parallel computational entities that consist of multiple simple processing units that are connected in specific ways in order to perform the desired tasks.

Remember: ANN are computationally primitive approximations of the real biological brains .
Perceptron
Artificial Neurons  McCullochPitts „Unit“
 Output is a „squashed“ linear function of the inputs:
 A clear oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do.
Activation Functions

(a) is a step function or threshold function, signum function
>

(b) is a sigmoid function 1/(1+e^{x})

Changing the bias weight W_{0,i} moves the threshold location
Perceptron
 McCullochPitts neurons can be connected together in any desired way to build an artificial neural network.
 A construct of one input layer of neurons that feed forward to one output layer of neurons is called Perceptron .
Expressiveness of Perceptrons
 A perceptron with g = step function can model Boolean functions and linear classification:
 As we will see, a perceptron can represent AND, OR, NOT, but not XOR
 A perceptron represents a linear separator for the input space
Expressiveness of Perceptrons (2)
 Threshold perceptrons can represent only linearly separable functions (i.e. functions for which such a separation hyperplane exists)
 Such perceptrons have limited expressivity.
 There exists an algorithm that can fit a threshold perceptron to any linearly separable training set.
Example: Logical Functions

McCulloch and Pitts: Boolean function can be implemented with a artificial neuron (not XOR).
Example: Finding Weights for AND Operation
 There are two input weights W_{1} and W_{2} and a treshold W0. For each training pattern the perceptron needs to satisfay the following equation:
 out = g(W_{1}*a_{1} + W_{2}*a_{2} – W_{0}) = sign(W_{1}*a_{1} + W_{2}*a_{2} – W_{0})
 For a binary AND there are four training data items available that lead to four inequalities:
 W_{1}*0 + W_{2}*0 – W_{0} < 0 ⇒ W_{0} > 0
 W_{1}*0 + W_{2}*1 – W_{0} < 0 ⇒ W_{2} < W_{0}
 W_{1}*1 + W_{2}*0 – W_{0} < 0 ⇒ W_{1} < W_{0}
 W_{1}*1 + W_{2}*1 – W_{0} ≥ 0 ⇒ W_{1} + W_{2} ≥ W_{0}
 There is an obvious infinite number of solutions that realize a logical AND; e.g. W_{1} = 1, W_{2} = 1 and W_{0} = 1.5.
Limitations of Simple Perceptrons
 XOR:
 W_{1}*0 + W_{2}*0 – W_{0} < 0 ⇒ W_{0} > 0
 W_{1}*0 + W_{2}*1 – W_{0} ≥ 0 ⇒ W_{2} ≥ W_{0}
 W_{1}*1 + W_{2}*0 – W_{0} ≥ 0 ⇒ W_{1} ≥ W_{0}
 W_{1}*1 + W_{2}*1 – W_{0} < 0 ⇒ W_{1} + W_{2} < W_{0}
 The 2nd and 3rd inequalities are not compatible with inequality 4, and there is no solution to the XOR problem.
 XOR requires two separation hyperplanes!
 There is thus a need for more complex networks that combine simple perceptrons to address more sophisticated classification tasks.
Artificial Neural Network Structures
Neural Network Structures
 Mathematically artificial neural networks are represented by weighted directed graphs.
 In more practical terms, a neural network has activations flowing between processing units via oneway connections.
 There are three common artificial neural network architectures known:
 SingleLayer FeedForward (Perceptron)
 MultiLayer FeedForward
 Recurrent Neural Network
SingleLayer FeedForward
 A SingleLayer FeedForward Structure is a simple perceptron, and has thus

 Feedforward networks implement functions, have no internal state (of course also valid for multilayer perceptrons).
SingleLayer FeedForward: Example
 Output units all operate separately – no shared weights
 Adjusting weights moves the location, orientation, and steepness of cliff (i.e., the separation hyperplane).
MultiLayer FeedForward
 MultiLayer FeedForward Structures have:
 one input layer
 one output layer
 one or many hidden layers of processing units
 The hidden layers are between the input and the output layer, and thus hidden from the outside world: no input from the world, not output to the world.
MultiLayer FeedForward (2)
 MultiLayer Perceptrons (MLP) have fully connected layers.
 The numbers of hidden units is typically chosen by hand; the more layers, the more complex the network (see Step 2 of Building Neural Networks)
 Hidden layers enlarge the space of hypotheses that the network can represent.
 Learning done by backpropagation algorithm → errors are backpropagated from the output layer to the hidden layers.
Simple MLP Example

XOR Problem: Recall that XOR cannot be modeled with a SingleLayer FeedForward perceptron.
Expressiveness of MLPs
 1 hidden layer can represent all continuous functions
 2 hidden layers can represent all functions
 Combining two oppositefacing threshold functions makes a ridge.
 Combining two perpendicular ridges makes a bump.
 Add bumps of various sizes and locations to fit any surface.
 The more hidden units, the more bumps.
Number of Hidden Layers

Rule of Thumb 1
Even if the function to learn is slightly nonlinear, the generalization may be better with a simple linear model than with a complicated nonlinear model; if there is too little data or too much noise to estimate the nonlinearities accurately.

Rule of Thumb 2
If there is only one input, there seems to be no advantage to using more than one hidden layer; things get much more complicated when there are two or more inputs.
Example: Number of Hidden Layers
1st layer draws linear boundaries 
2nd layer combines the boundaries. 
3rd layer can generate arbitrarily boundaries. 
Recurrent Network
 Recurrent networks have at least one feedback connection:
 They have directed cycles with delays: they have internal states (like flip flops), can oscillate, etc.
 The response to an input depends on the initial state which may depend on previous inputs.
 This creates an internal state of the network which allows it to exhibit dynamic temporal behaviour; offers means to model shorttime memory
Building Neural Networks

Building a neural network for particular problems requires multiple steps:
 Determine the input and outputs of the problem;
 Start from the simplest imaginable network, e.g. a single feedforward perceptron;
 Find the connection weights to produce the required output from the given training data input;
 Ensure that the training data passes successfully, and test the network with other training/testing data;
 Go back to Step 3 if performance is not good enough;
 Repeat from Step 2 if Step 5 still lacks performance; or
 Repeat from Step 1 if the network in Step 6 does still not perform well enough.
Learning and Generalization
Learning and Generalization
 Neural networks have two important aspects to fulfill:
 They must learn decision surfaces from training data, so that training data (and test data) are classified correctly;
 They must be able to generalize based on the learning process, in order to classify data sets it has never seen before.
 Note that there is an important tradeoff between the learning behavior and the generalization of a neural network (called overfitting)
 The better a network learns to successfully classify a training sequence (that might contain errors) the less flexible it is with respect to arbitrary data.
Learning vs. Generalization
 Noise in the actual data is never a good thing, since it limits the accuracy of generalization that can be achieved no matter how extensive the training set is.
 Nonperfect learning is better in this case!
 „Perfect“ learning achieves the dotted separation, while the desired one
is in fact given by the solid line.
 However, injecting artificial noise (socalled jitter ) into the inputs during training is one of several ways to improve generalization
 „Perfect“ learning achieves the dotted separation, while the desired one is in fact given by the solid line.
Estimation of Generalization Error
 There are many methods for estimating generalization errors.
 Singlesample statistics
 Statistical theory provides estimators for the generalization error in nonlinear models with a "large" training set.
 Splitsample or holdout validation.
 The most commonly used method: reserve some data as a "test set”, which must not be used during training.
 The test set must represent the cases that the ANN should generalize to.
 A rerun with the random test set provides an unbiased estimate of the generalization error.
 The disadvantage of splitsample validation is that it reduces the amount of data available for both training and validation.
Estimation of Generalization Error
 Crossvalidation (e.g., leave one out).
 In kfold crossvalidation the data is divided into k subsets, and the network is trained k times, each time leaving one subset out for computing the error.
 The “crossing” makes this method an improvement over splitsampling; it allows all data to be used for training.
 The disadvantage of crossvalidation is that the network must be retrained many times (k times in kfold crossing).
 Bootstrapping.
 Bootstrapping works on random subsamples (random shares) that are chosen from the full data set.
 Any data item may be selected any number of times for validation.
 The subsamples are repeatedly analyzed.
 No matter which method is applied, the estimate of the generalization error of the best network will be optimistic.
Algorithm for ANN Learning

Learning is based on training data, and aims at appropriate weights for the perceptrons in a network.

Direct computation is in the general case not feasible.

An initial random assignment of weights simplifies the learning process that becomes an iterative adjustment process.

In the case of single perceptrons, learning becomes the process of moving hyperplanes around; parametrized over time t: Wi(t+1) = Wi(t) + Δ Wi(t)
Perceptron Learning
 The squared error for an example with input x and desired output y is: \[ E = \frac{1}{2} Err^{2} = \frac{1}{2} (yg_{w}(x))^{2} \]
 Perform optimization search by gradient descent: \[ \frac{\partial E}{\partial W_{j}} = Err \times \frac{\partial Err}{\partial W_{j}} = Err \times \frac{\partial }{\partial W_{j}} (yg (\sum_{j=0}^{n} W_{j}x_{j})) =  Err \times g'(in) \times x_{j} \]
 Simple weight update rule: W _{ j } ← W _{ j } + α × Err × g'(in) × x _{ j }
 Positive error ⇒ increase network output
 increase weights on positive inputs,
 decrease on negative inputs
Perceptron Learning (2)
 The weight updates need to be applied repeatedly for each weight W_{j} in the network, and for each training suite in the training set.
 One such cycle through all weighty is called an epoch of training.
 Eventually, mostly after many epochs, the weight changes converge towards zero and the training process terminates.
 The perceptron learning process always finds a set of weights for a perceptron that solves a problem correctly with a finite number of epochs, if such a set of weights exists.
 If a problem can be solved with a separation hyperplane, then the set of weights is found in finite iterations and solves the problem correctly.
Perceptron Learning (3)
 Perceptron learning rule converges to a consistent function for any linearly separable data set
 Perceptron learns majority function easily, DecisionTree is hopeless
 DecisionTree learns restaurant function easily, perceptron cannot represent it.
Example Functions
 Majority Function: the output is false when n/2 or more inputs are false, and true otherwise.
 Restaurant Function: decide whether to wait for a table or not:
 How many guests ( patron ) are in the restaurant?
 What is the estimated waiting time?
 Any alternative restaurant nearby?
 Are we hungry ?
 Do we have a reservation ?
 Is it Friday or Saturday ?
 Is there a comfortable bar area to wait in?
 Is it raining outside?
BackPropagation Learning
 The errors (and therefore the learning) propagate backwards from the output layer to the hidden layers.

Learning at the output layer is the same as for singlelayer perceptron:
 W_{j} ← W_{j} + α × Err × g'(in) × x_{j}
 Hidden layer neurons get a "blame" assigned for the error (backpropagation of error), giving greater responsibility to neurons connected by stronger weight.
 Backpropagation of error updates the weights of the hidden layer; the principle thus stays the same.
BackPropagation Learning (2)
 Training curve for 100 restaurant examples converges to a perfect fit to the training data
 MLPs are quite good for complex pattern recognition tasks, but resulting hypotheses cannot be understood easily

Typical problems: slow convergence, local minima
ILLUSTRATION BY AN EXAMPLE
Perceptron Learning Example
 The applied algorithm is as follows
 Initialize the weights and threshold to small random numbers.
 Present a vector x to the neuron inputs and calculate the output.
 Update the weights according to the error.
 Applied learning function: W_{j}(t+1) ← W_{j}(t) + α × (yg_{w}(x)) × x_{j}
 Example with two inputs x1, x2
Perceptron Learning Example
 Data: (0,0)→0, (1,0)→0, (0,1)→1, (1,1)→1
 Initialization: W_{1}(0) = 0.9286, W_{2}(0) = 0.62, W_{0}(0) = 0.2236, α = 0.1

Training – epoch 1:
out1 = sign(0.92*0 + 0.62*0 – 0.22) = sign(0.22) = 0 out2 = sign(0.92*1 + 0.62*0 – 0.22) = sign(0.7) = 1 X  W_{1}(1) = 0.92 + 0.1 * (0 – 1) * 1 = 0.82
W_{2}(1) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W_{0}(1) = 0.22 + 0.1 * (0 – 1) * (1)= 0.32out3 = sign(0.82*0 + 0.62*1 – 0.32) = sign(0.5) = 1
out4 = sign(0.82*1 + 0.62*1 – 0.32) = 1
Perceptron Learning Example

Training – epoch 2:
out1 = sign(0.82*0 + 0.62*0 – 0.32) = sign(0.32) = 0out2 = sign(0.82*1 + 0.62*0 – 0.32) = sign(0.5) = 1
 W_{1}(2) = 0.82 + 0.1 * (0 – 1) * 1 = 0.72
W_{2}(2) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W_{0}(2) = 0.32 + 0.1 * (0 – 1) * (1)= 0.42Xout3 = sign(0.72*0 + 0.62*1 – 0.42) = sign(0.3) = 1out4 = sign(0.72*1 + 0.62*1 – 0.42) = 1

Training – epoch 3:
out1 = sign(0.72*0 + 0.62*0 – 0.42) = 0out2 = sign(0.72*1 + 0.62*0 – 0.42) = 1 W_{1}(3) = 0.72 + 0.1 * (0 – 1) * 1 = 0.62
W_{2}(3) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W_{0}(3) = 0.42 + 0.1 * (0 – 1) * (1)= 0.52X
Perceptron Learning Example
out3 = sign(0.62*0 + 0.62*1 – 0.52) = 1 

out4 = sign(0.62*1 + 0.62*1 – 0.52) = 1 

Training – epoch 4:
out1 = sign(0.62*0 + 0.62*0 – 0.52) = 0out2 = sign(0.62*1 + 0.62*0 – 0.52) = 1 W_{1}(4) = 0.62 + 0.1 * (0 – 1) * 1 = 0.52
W_{2}(4) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W_{0}(4) = 0.52 + 0.1 * (0 – 1) * (1)= 0.62Xout3 = sign(0.52*0 + 0.62*1 – 0.62) = 0 W_{1}(4) = 0.52 + 0.1 * (1 – 0) * 0 = 0.52
W_{2}(4) = 0.62 + 0.1 * (1 – 0) * 1 = 0.72
W_{0}(4) = 0.62 + 0.1 * (1 – 0) * (1)= 0.52Xout4 = sign(0.52*1 + 0.72*1 – 0.52) = 1
Perceptron Learning Example

Finally:
out1 = sign(0.12*0 + 0.82*0 – 0.42) = 0 out2 = sign(0.12*1 + 0.82*0 – 0.42) = 0 out3 = sign(0.12*0 + 0.82*1 – 0.42) = 1 out4 = sign(0.12*1 + 0.82*1 – 0.42) = 1
Further Application Examples
 There are endless further examples: :
 Handwriting Recognition
 Time Series Prediction
 Kernel Machines (Support Vectore Machines)
 Data Compression
 Financial Predication
 Speech Recognition
 Computer Vision
 Protein Structures
 ...
SUMMARY
Summary
 Most brains have lots of neurons, each neuron approximates a linearthreshold unit.
 Perceptrons (onelayer networks) approximate neurons, but are as such insufficiently expressive.
 Multilayer networks are sufficiently expressive; can be trained to deal with generalized data sets, i.e. via error backpropagation.
 Multilayer networks allow for the modeling of arbitrary separation boundaries, while singlelayer perceptrons only provide linear boundaries.
 Endless number of applications: Handwriting Recognition, Time Series Prediction, Bioinformatics, Kernel Machines (Support Vectore Machines), Data Compression, Financial Predication, Speech Recognition, Computer Vision, Protein Structures...
REFERENCES
References

Mandatory Readings:
 McCulloch, W.S. & Pitts, W. (1943): A Logical Calculus of the Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics 5, pp. 115133.
 Rosenblatt, F. (1958): The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Reviews 65, pp. 386408.

Further Readings:
 Elman, J. L. (1990): Finding structure in time. Cognitive Science 14, pp.179–211.
 Gallant, S. I. (1990): Perceptronbased learning algorithms. IEEE Transactions on Neural Networks 1 (2), pp. 179191.
 Rumelhart, D.E., Hinton, G. E. & Williams, R. J. (1986): Learning representations by backpropagating errors. Nature 323, pp. 533536.
 Supervised learning demo (perceptron learning rule) at http://lcn.epfl.ch/tutorial/english/perceptron/html/
References

Wikipedia References:
 http://en.wikipedia.org/wiki/Biological_neural_networks
 http://en.wikipedia.org/wiki/Artificial_neural_network
 http://en.wikipedia.org/wiki/Perceptron
 http://en.wikipedia.org/wiki/Feedforward_neural_network
 http://en.wikipedia.org/wiki/Recurrent_neural_networks
 http://en.wikipedia.org/wiki/Back_propagation
Questions?