Rule Learning
Slides in this lecture are partially based
on [1], Section 3 “Decision Tree Learning”.
MOTIVATION
Research on learning in AI is made up of diverse subfields (cf. [5])
Problem solving methods
Discrimination nets
Machine Learning is a central research area in AI to acquire knowledge [1].
is a means to alleviate the „bottleneck“ (knowledge aquisition) – problem in expert systems [4].
I = {milk, bread, butter, beer}
Sample database
ID

Milk

Bread

Butter

Beer

1

1

1

1

0

2

0

1

1

0

3

0

0

0

1

4

1

1

1

0

5

0

1

0

0

A possible rule would be {milk, bread} → {butter}
TECHNICAL SOLUTIONS
Many inductive knowledge acquisition algorithms generate („induce“) classifiers in form of decision trees.
Intermediate nodes represent tests
Decision trees classify instances by sorting them down the tree from the root to some leaf node which provides the classification of the instance.
{Outlook = Sunny, Temperature = Hot,
Humidity = High, Wind = Strong}:
Negative Instance
People are able to understand decision trees model after a brief explanation.
Decision trees are easily interpreted
Other techniques are usually specialised in analysing datasets that have only one type of variable. Ex: neural networks can be used only with numerical variables
If a given situation is observable in a model the explanation for the condition is easily explained by boolean logic. An example of a black box model is an artificial neural network since the explanation for the results is excessively complex to be comprehended.
Large amounts of data can be analysed using personal computers in a time short enough to enable stakeholders to take decisions based on its analysis.
RULE LEARNING TASKS
Learning of correct rule that cover a maximal number of positive examples but no negative ones. We call this maximal requirement for each rule.
Learning of a minimal rule set that covers all positive example. We call this minimal requirement for the rule set.
To illustrate the two rule learning tasks let’s consider a data set, containing both positive and negative examples, as depicted in the following chart
We want to learn a correct rule that cover a maximal number of positive examples but no negative examples (maximal requirement for each rule).
If x=3 and y=3 then class = positive
The minimal requirement for the rule set is about learning of a minimal rule set that covers all positive example.
The minimal requirement for the rule set is similar with the minimal set cover problem [18]
The minimal requirement for the rule set, which basically translates to building the optimal linear decision tree, is NPhard [19].
RULE LEARNING APPROACHES
Rule learning can be seen as a search problem in the latice of possible rules
Combining of specialization and generalization – the search procedure can start at any arbitrary point in the lattice and can move freely up or down as needed.
Specialization and Generalization are dual search directions in a given rule set.
SPECIALIZATION
This is done by adding additional premises to the antecedent of a rule, or by restricting the range of an attribute which is used in an antecedent.
This brings along the risk for ending up with results that are not maximalgeneral.
Some examples of (heuristic) specialization algorithms are the following: ID3, AQ, C4, CN2, CABRO, FOIL, or PRISM; references at the end of the lecture.
THE ID3 ALGORITHM
Core question: Which attribute is the best classifier?
ID3 uses a statistical measure called information gain that measures how well a given example separates the training examples according to their target classification.




Information gain measures the expected reduction in entropy caused by partitioning the examples according to an attribute:
First term: entropy of the original collection S; second term: expected value of entropy after S is partitioned using attribute A (Sv subset of S).
Gain(S,A): The expected reduction in entropy caused by knowing the value of attribute A.
ID3 uses information gain to select the best attribute at each step in growing the tree.
ID3’s hypothesis space is the complete space of finite discretevalued functions, relative to the available attributes.
ID3 maintains only a single current hypothesis, as it searches through the space of trees (earlier (perhaps better) versions are eliminated).
ID3 performs no backtracking in search; it converges to locally optimal solutions that are not globally optimal.
ID3 uses all training data at each step to make statistically based decisions regarding how to refine its current hypothesis.
ID3 searches a complete hypothesis search.
Termination condition: Discovery of a hypothesis consistent with the data
ID3’s inductive bias is a preference bias (it prefers certain hypotheses over others).
Alternatively: A restriction bias categorically restricts considered hypotheses.
…or why prefer short hypotheses?
William Occam was one of the first to discuss this question around year 1320.
Several arguments have been discussed since then.
Very complex hypotheses fail to generalize correctly to subsequent data.
Target attribute: PlayTennis (values: yes / no)
ID3 determines information gain for each candidate attribute (e.g., Outlook, Temperature, Humidity, and Wind) and selects the one with the highest information gain
Gain(S, Outlook) = 0.246 ; Gain(S, Humidity) = 0.151; Gain(S, Wind) = 0.048; Gain(S, Temperature)=0.029
Determing new attributes at the „Sunny“ – branch only using the examples classified there:
The training examples associated with this leaf node all have the same target attribute value (i.e., their entropy is zero).
GENERALIZATION
The generalization procedure stops if no more premises to remove exist.
However, generalization of course risks to derive final results that are not mostspecific.
RELAX is an example of a generalizationbased algorithm; references at the end of the lecture.
THE RELAX ALGORITHM
1) x ∩ ¬ y ∩ z → C 
5) x → C 

2) ¬ y ∩ z → C 
6) ¬ y → C 

3) x ∩ z → C 
7) z → C 

4) x ∩ ¬ y → C 
8) → C 
COMBINING SPECIALIZATION AND GENERALIZATION
THE JOJO ALGORITHM
In general, it cannot be determined which search direction is the better one.
JoJo is an algorithm that combines both search directions in one heuristic search procedure.
JoJo can start at an arbitrary point in the lattice of complexes and generalizes and specializes as long as the quality and correctness can be improved, i.e. until a local optimum can be found, or no more search resources are available (e.g., time, memory).
Either of the two strategies can be used interchangeable, which makes JoJo more expressive than comparable algorithms that apply the two in sequential order (e.g. ID3).
Horizontal position (chosen premises)
Reminder: JoJo can start at any arbitrary point, while specialization requires a highly general point and generalization requires a most specific point.
In general, it is possible to carry out several runs of JoJo with different starting points. Rules that were already produced can be used as subsequent starting points.
Criteria for choosing a vertical position:
Approximation of possible lenght or experience from earlier runs.
Random production of rules; distribution by means of the average correctness of the rules with the same length (socalled quality criterion).
Start with a small sample or very limited resources to discover a real starting point from an arbitrary one.
Randomly chosen starting point (same average expectation of success as starting with ‘Top‘ or ‘Bottom‘).
Heuristic: few positive examples and maximalspecific descriptions suggest long rules, few negative examples and maximalgeneric descriptions rather short rules.
REFINEMENT OF RULES WITH JOJO
Refinement of rules refers to the modification of a given rule set based on additonal examples.
The input to the task is a socalled hypothesis (a set of rules) and a set of old and new positive and negative examples.
The output of the algorithm are a refined set of rules and the total set of examples.
The new set of rules is correct, complete, nonredundant and (if necessary) minimal.
There is a four step procedure for the refinment of rules:
Rules that become incorrect because of new examples are refined: incorrect rules are replaced by new rules that cover the positive examples, but not the new negative ones.
Complete the rule set to cover new positive examples.
Redundant rules are deleted to correct the rule set.
Steps 1 and 2 are subject to the algorithm JoJo that integrates generalization and specification via a heuristic search procedure.
Replace a rule by a new set of rules that cover the positive examples, but not the negative ones.
Compute new correct rules that cover the not yet considered positive examples (up to a threshold).
Remove rules that are more specific than other rules (i.e. rules that have premises that are a superset of the premises of another rule).
ILLUSTRATION BY A LARGER EXAMPLE: ID3
EXTENSIONS
THE C4.5 ALGORITHM

ID3 would search for new refinements at the bottom of the left tree.
Infer the decision tree from the training set, growing the set until the training data is fit as well as possible and allowing overfitting to occur.
Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf node.
Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy.
Rule accuracy estimation based on the training set using a pessimistic estimate: C4.5 calculates standard deviation and takes the lower bound as estimate for rule performance.
Central question: How to select the best value for a threshold c?
Approach:
Sort examples according to the continuous attribute A.
Identify adjacent examples that differ in their target classification.
Generate a set of candidate thresholds midway between the corresponding values of A (c must always lie at such a boundary, see [5]).
Information gain for the first threshold is higher.