### Agenda

• Motivation
• Technical Solution
• (Artificial) Neural Networks
• Perceptron
• Artificial Neural Network Structures
• Learning and Generalization
• Expressiveness of Multi-Layer Perceptrons
• Illustration by Larger Examples
• Summary

### Motivation

• A main motivation behind neural networks is the fact that symbolic rules do not reflect reasoning processes performed by humans.

• Biological neural systems can capture highly parallel computations based on representations that are distributed over many neurons.

• They learn and generalize from training data; no need for programming it all...

• They are very noise tolerant – better resistance than symbolic systems.

### Motivation

• Neural networks are stong in:

• Learning from a set of examples
• Optimizing solutions via constraints and cost functions
• Classification: grouping elements in classes
• Speech recognition, pattern matching
• Non-parametric statistical analysis and regressions

### What are Neural Networks?

• A biological neural network is composed of groups of chemically connected or functionally associated neurons.
• A single neuron may be connected to many other neurons and the total number of neurons and connections in a network may be extensive.
• Neurons are highly specialized cells that transmit impulses within animals to cause a change in a target cell such as a muscle effector cell or a glandular cell.
• Impulses are noisy “spike trains” of electronical potential.
• Apart from the electrical signaling, neurotransmitter diffusion (endogenous chemicals) can effectuate the impulses.

### What are Neural Networks?

• Connections, called synapses , are usually formed from axons to dendrites ; also other connections are possible.
• The axon is the primary conduit through which the neuron transmits impulses to neurons downstream in the signal chain
• A human has 1011 neurons of > 20 types, 1014 synapses, 1ms-10ms cycle time.
• There are 1011 in a galaxy, and 1011 galaxies in the universe.
• Not only the impulse carries value
but also the frequency of
oscillation, hence neural
networks have an even
higher complexity.

### What are Neural Networks?

• Metaphorically speaking, a thought is a specific self-oscillation of a network of neurons.

• The topology of the network determines its resonance.

• However, it is the resonance in the brain's interaction with the environment and with itself that creates, reinforces or decouples interaction patterns.

• The brain is not a static device, but a device that is created through usage…

### What are Neural Networks?

• What we refer to as Neural Networks, in this course, are mostly Artificial Neural Networks (ANN).

• ANN are approximations of biological neural networks and are built of physical devices, or simulated on computers.

• ANN are parallel computational entities that consist of multiple simple processing units that are connected in specific ways in order to perform the desired tasks.

• Remember: ANN are computationally primitive approximations of the real biological brains .

### Artificial Neurons - McCulloch-Pitts „Unit“

• Output is a „squashed“ linear function of the inputs:
• A clear oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do.

### Activation Functions

(a) is a step function or threshold function, signum function >
(b) is a sigmoid function 1/(1+e-x)
• Changing the bias weight W0,i moves the threshold location

### Perceptron

• McCulloch-Pitts neurons can be connected together in any desired way to build an artificial neural network.
• A construct of one input layer of neurons that feed forward to one output layer of neurons is called Perceptron .

### Expressiveness of Perceptrons

• A perceptron with g = step function can model Boolean functions and linear classification:
• As we will see, a perceptron can represent AND, OR, NOT, but not XOR
• A perceptron represents a linear separator for the input space

### Expressiveness of Perceptrons (2)

• Threshold perceptrons can represent only linearly separable functions (i.e. functions for which such a separation hyperplane exists)
• Such perceptrons have limited expressivity.
• There exists an algorithm that can fit a threshold perceptron to any linearly separable training set.

### Example: Logical Functions

• McCulloch and Pitts: Boolean function can be implemented with a artificial neuron (not XOR).

### Example: Finding Weights for AND Operation

• There are two input weights W1 and W2 and a treshold W0. For each training pattern the perceptron needs to satisfay the following equation:
out = g(W1*a1 + W2*a2 – W0) = sign(W1*a1 + W2*a2 – W0)
• For a binary AND there are four training data items available that lead to four inequalities:
• W1*0 + W2*0 – W0 < 0        ⇒ W0 > 0
• W1*0 + W2*1 – W0 < 0        ⇒ W2 < W0
• W1*1 + W2*0 – W0 < 0        ⇒ W1 < W0
• W1*1 + W2*1 – W0 ≥ 0        ⇒ W1 + W2 ≥ W0
• There is an obvious infinite number of solutions that realize a logical AND; e.g. W1 = 1, W2 = 1 and W0 = 1.5.

### Limitations of Simple Perceptrons

• XOR:
• W1*0 + W2*0 – W0 < 0        ⇒ W0 > 0
• W1*0 + W2*1 – W0 ≥ 0        ⇒ W2 ≥ W0
• W1*1 + W2*0 – W0 ≥ 0        ⇒ W1 ≥ W0
• W1*1 + W2*1 – W0 < 0        ⇒ W1 + W2 < W0
• The 2nd and 3rd inequalities are not compatible with inequality 4, and there is no solution to the XOR problem.
• XOR requires two separation hyperplanes!
• There is thus a need for more complex networks that combine simple perceptrons to address more sophisticated classification tasks.

### Neural Network Structures

• Mathematically artificial neural networks are represented by weighted directed graphs.
• In more practical terms, a neural network has activations flowing between processing units via one-way connections.

• There are three common artificial neural network architectures known:
• Single-Layer Feed-Forward (Perceptron)
• Multi-Layer Feed-Forward
• Recurrent Neural Network

### Single-Layer Feed-Forward

• A Single-Layer Feed-Forward Structure is a simple perceptron, and has thus
 one input layer one output layer NO feed-back connections
• Feed-forward networks implement functions, have no internal state (of course also valid for multi-layer perceptrons).

### Single-Layer Feed-Forward: Example

• Output units all operate separately – no shared weights
• Adjusting weights moves the location, orientation, and steepness of cliff (i.e., the separation hyperplane).

### Multi-Layer Feed-Forward

• Multi-Layer Feed-Forward Structures have:
• one input layer
• one output layer
• one or many hidden layers of processing units
• The hidden layers are between the input and the output layer, and thus hidden from the outside world: no input from the world, not output to the world.

### Multi-Layer Feed-Forward (2)

• Multi-Layer Perceptrons (MLP) have fully connected layers.
• The numbers of hidden units is typically chosen by hand; the more layers, the more complex the network (see Step 2 of Building Neural Networks)
• Hidden layers enlarge the space of hypotheses that the network can represent.
• Learning done by back-propagation algorithm → errors are back-propagated from the output layer to the hidden layers.

### Simple MLP Example

• XOR Problem: Recall that XOR cannot be modeled with a Single-Layer Feed-Forward perceptron.

### Expressiveness of MLPs

• 1 hidden layer can represent all continuous functions
• 2 hidden layers can represent all functions
• Combining two opposite-facing threshold functions makes a ridge.
• Combining two perpendicular ridges makes a bump.
• Add bumps of various sizes and locations to fit any surface.
• The more hidden units, the more bumps.

### Number of Hidden Layers

• Rule of Thumb 1
Even if the function to learn is slightly non-linear, the generalization may be better with a simple linear model than with a complicated non-linear model; if there is too little data or too much noise to estimate the non-linearities accurately.

• Rule of Thumb 2
If there is only one input, there seems to be no advantage to using more than one hidden layer; things get much more complicated when there are two or more inputs.

### Example: Number of Hidden Layers

 1st layer draws linear boundaries 2nd layer combines the boundaries. 3rd layer can generate arbitrarily boundaries.

### Recurrent Network

• Recurrent networks have at least one feedback connection:
• They have directed cycles with delays: they have internal states (like flip flops), can oscillate, etc.
• The response to an input depends on the initial state which may depend on previous inputs.
• This creates an internal state of the network which allows it to exhibit dynamic temporal behaviour; offers means to model short-time memory

### Building Neural Networks

• Building a neural network for particular problems requires multiple steps:
1. Determine the input and outputs of the problem;
2. Start from the simplest imaginable network, e.g. a single feed-forward perceptron;
3. Find the connection weights to produce the required output from the given training data input;
4. Ensure that the training data passes successfully, and test the network with other training/testing data;
5. Go back to Step 3 if performance is not good enough;
6. Repeat from Step 2 if Step 5 still lacks performance; or
7. Repeat from Step 1 if the network in Step 6 does still not perform well enough.

### Learning and Generalization

• Neural networks have two important aspects to fulfill:
• They must learn decision surfaces from training data, so that training data (and test data) are classified correctly;
• They must be able to generalize based on the learning process, in order to classify data sets it has never seen before.
• Note that there is an important trade-off between the learning behavior and the generalization of a neural network (called over-fitting)
• The better a network learns to successfully classify a training sequence (that might contain errors) the less flexible it is with respect to arbitrary data.

### Learning vs. Generalization

• Noise in the actual data is never a good thing, since it limits the accuracy of generalization that can be achieved no matter how extensive the training set is.
• Non-perfect learning is better in this case!
• „Perfect“ learning achieves the dotted separation, while the desired one is in fact given by the solid line.
• However, injecting artificial noise (so-called jitter ) into the inputs during training is one of several ways to improve generalization
• „Perfect“ learning achieves the dotted separation, while the desired one is in fact given by the solid line.

### Estimation of Generalization Error

• There are many methods for estimating generalization errors.
• Single-sample statistics
• Statistical theory provides estimators for the generalization error in non-linear models with a "large" training set.
• Split-sample or hold-out validation.
• The most commonly used method: reserve some data as a "test set”, which must not be used during training.
• The test set must represent the cases that the ANN should generalize to.
• A re-run with the random test set provides an unbiased estimate of the generalization error.
• The disadvantage of split-sample validation is that it reduces the amount of data available for both training and validation.

### Estimation of Generalization Error

• Cross-validation (e.g., leave one out).
• In k-fold cross-validation the data is divided into k subsets, and the network is trained k times, each time leaving one subset out for computing the error.
• The “crossing” makes this method an improvement over split-sampling; it allows all data to be used for training.
• The disadvantage of cross-validation is that the network must be re-trained many times (k times in k-fold crossing).
• Bootstrapping.
• Bootstrapping works on random sub-samples (random shares) that are chosen from the full data set.
• Any data item may be selected any number of times for validation.
• The sub-samples are repeatedly analyzed.
• No matter which method is applied, the estimate of the generalization error of the best network will be optimistic.

### Algorithm for ANN Learning

• Learning is based on training data, and aims at appropriate weights for the perceptrons in a network.

• Direct computation is in the general case not feasible.

• An initial random assignment of weights simplifies the learning process that becomes an iterative adjustment process.

• In the case of single perceptrons, learning becomes the process of moving hyperplanes around; parametrized over time t: Wi(t+1) = Wi(t) + Δ Wi(t)

### Perceptron Learning

• The squared error for an example with input x and desired output y is:
• $E = \frac{1}{2} Err^{2} = \frac{1}{2} (y-g_{w}(x))^{2}$
• Perform optimization search by gradient descent:
• $\frac{\partial E}{\partial W_{j}} = Err \times \frac{\partial Err}{\partial W_{j}} = Err \times \frac{\partial }{\partial W_{j}} (y-g (\sum_{j=0}^{n} W_{j}x_{j})) = - Err \times g'(in) \times x_{j}$
• Simple weight update rule: W j ← W j + α × Err × g'(in) × x j
• Positive error ⇒ increase network output
• increase weights on positive inputs,
• decrease on negative inputs

### Perceptron Learning (2)

• The weight updates need to be applied repeatedly for each weight Wj in the network, and for each training suite in the training set.
• One such cycle through all weighty is called an epoch of training.
• Eventually, mostly after many epochs, the weight changes converge towards zero and the training process terminates.
• The perceptron learning process always finds a set of weights for a perceptron that solves a problem correctly with a finite number of epochs, if such a set of weights exists.
• If a problem can be solved with a separation hyperplane, then the set of weights is found in finite iterations and solves the problem correctly.

### Perceptron Learning (3)

• Perceptron learning rule converges to a consistent function for any linearly separable data set
• Perceptron learns majority function easily, Decision-Tree is hopeless
• Decision-Tree learns restaurant function easily, perceptron cannot represent it.

### Example Functions

• Majority Function: the output is false when n/2 or more inputs are false, and true otherwise.
• Restaurant Function: decide whether to wait for a table or not:
• How many guests ( patron ) are in the restaurant?
• What is the estimated waiting time?
• Any alternative restaurant nearby?
• Are we hungry ?
• Do we have a reservation ?
• Is it Friday or Saturday ?
• Is there a comfortable bar area to wait in?
• Is it raining outside?

### Back-Propagation Learning

• The errors (and therefore the learning) propagate backwards from the output layer to the hidden layers.
• Learning at the output layer is the same as for single-layer perceptron:

Wj ← Wj + α × Err × g'(in) × xj
• Hidden layer neurons get a "blame" assigned for the error (back-propagation of error), giving greater responsibility to neurons connected by stronger weight.
• Back-propagation of error updates the weights of the hidden layer; the principle thus stays the same.

### Back-Propagation Learning (2)

• Training curve for 100 restaurant examples converges to a perfect fit to the training data
• MLPs are quite good for complex pattern recognition tasks, but resulting hypotheses cannot be understood easily
• Typical problems: slow convergence, local minima

### Perceptron Learning Example

• The applied algorithm is as follows
• Initialize the weights and threshold to small random numbers.
• Present a vector x to the neuron inputs and calculate the output.
• Update the weights according to the error.
• Applied learning function: Wj(t+1) ← Wj(t) + α × (y-gw(x)) × xj
• Example with two inputs x1, x2

### Perceptron Learning Example

• Data: (0,0)→0, (1,0)→0, (0,1)→1, (1,1)→1
• Initialization: W1(0) = 0.9286, W2(0) = 0.62, W0(0) = 0.2236, α = 0.1
• Training – epoch 1:  out1 = sign(0.92*0 + 0.62*0 – 0.22) = sign(-0.22) = 0 out2 = sign(0.92*1 + 0.62*0 – 0.22) = sign(0.7) = 1 X W1(1) = 0.92 + 0.1 * (0 – 1) * 1 = 0.82 W2(1) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62 W0(1) = 0.22 + 0.1 * (0 – 1) * (-1)= 0.32 out3 = sign(0.82*0 + 0.62*1 – 0.32) = sign(0.5) = 1 out4 = sign(0.82*1 + 0.62*1 – 0.32) = 1

### Perceptron Learning Example

• Training – epoch 2:

 out1 = sign(0.82*0 + 0.62*0 – 0.32) = sign(0.32) = 0 out2 = sign(0.82*1 + 0.62*0 – 0.32) = sign(0.5) = 1 W1(2) = 0.82 + 0.1 * (0 – 1) * 1 = 0.72 W2(2) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62 W0(2) = 0.32 + 0.1 * (0 – 1) * (-1)= 0.42 X out3 = sign(0.72*0 + 0.62*1 – 0.42) = sign(0.3) = 1 out4 = sign(0.72*1 + 0.62*1 – 0.42) = 1
• Training – epoch 3:

 out1 = sign(0.72*0 + 0.62*0 – 0.42) = 0 out2 = sign(0.72*1 + 0.62*0 – 0.42) = 1 W1(3) = 0.72 + 0.1 * (0 – 1) * 1 = 0.62 W2(3) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62 W0(3) = 0.42 + 0.1 * (0 – 1) * (-1)= 0.52 X

### Perceptron Learning Example

 out3 = sign(0.62*0 + 0.62*1 – 0.52) = 1 out4 = sign(0.62*1 + 0.62*1 – 0.52) = 1
• Training – epoch 4:

 out1 = sign(0.62*0 + 0.62*0 – 0.52) = 0 out2 = sign(0.62*1 + 0.62*0 – 0.52) = 1 W1(4) = 0.62 + 0.1 * (0 – 1) * 1 = 0.52 W2(4) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62 W0(4) = 0.52 + 0.1 * (0 – 1) * (-1)= 0.62 X out3 = sign(0.52*0 + 0.62*1 – 0.62) = 0 W1(4) = 0.52 + 0.1 * (1 – 0) * 0 = 0.52 W2(4) = 0.62 + 0.1 * (1 – 0) * 1 = 0.72 W0(4) = 0.62 + 0.1 * (1 – 0) * (-1)= 0.52 X out4 = sign(0.52*1 + 0.72*1 – 0.52) = 1

### Perceptron Learning Example

• Finally:

 out1 = sign(0.12*0 + 0.82*0 – 0.42) = 0 out2 = sign(0.12*1 + 0.82*0 – 0.42) = 0 out3 = sign(0.12*0 + 0.82*1 – 0.42) = 1 out4 = sign(0.12*1 + 0.82*1 – 0.42) = 1

### Further Application Examples

• There are endless further examples: :
• Handwriting Recognition
• Time Series Prediction
• Kernel Machines (Support Vectore Machines)
• Data Compression
• Financial Predication
• Speech Recognition
• Computer Vision
• Protein Structures
• ...

### Summary

• Most brains have lots of neurons, each neuron approximates a linear-threshold unit.
• Perceptrons (one-layer networks) approximate neurons, but are as such insufficiently expressive.
• Multi-layer networks are sufficiently expressive; can be trained to deal with generalized data sets, i.e. via error back-propagation.
• Multi-layer networks allow for the modeling of arbitrary separation boundaries, while single-layer perceptrons only provide linear boundaries.
• Endless number of applications: Handwriting Recognition, Time Series Prediction, Bioinformatics, Kernel Machines (Support Vectore Machines), Data Compression, Financial Predication, Speech Recognition, Computer Vision, Protein Structures...

### References

• McCulloch, W.S. & Pitts, W. (1943): A Logical Calculus of the Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics 5, pp. 115-133.
• Rosenblatt, F. (1958): The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Reviews 65, pp. 386-408.
• Elman, J. L. (1990): Finding structure in time. Cognitive Science 14, pp.179–211.
• Gallant, S. I. (1990): Perceptron-based learning algorithms. IEEE Transactions on Neural Networks 1 (2), pp. 179-191.
• Rumelhart, D.E., Hinton, G. E. & Williams, R. J. (1986): Learning representations by back-propagating errors. Nature 323, pp. 533-536.
• Supervised learning demo (perceptron learning rule) at http://lcn.epfl.ch/tutorial/english/perceptron/html/

### References

Wikipedia References:
• http://en.wikipedia.org/wiki/Biological_neural_networks
• http://en.wikipedia.org/wiki/Artificial_neural_network
• http://en.wikipedia.org/wiki/Perceptron
• http://en.wikipedia.org/wiki/Feedforward_neural_network
• http://en.wikipedia.org/wiki/Recurrent_neural_networks
• http://en.wikipedia.org/wiki/Back_propagation