### Agenda

• Basic Concepts
• Frequent Itemset Mining Methods
• Which Patterns Are Interesting?—Pattern Evaluation Methods
• Summary

### What Is Frequent Pattern Analysis?

• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining
• Motivation: Finding inherent regularities in data
• What products were often purchased together?— Beer and diapers?!
• What are the subsequent purchases after buying a PC?
• What kinds of DNA are sensitive to this new drug?
• Can we automatically classify web documents?
• Applications
• Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

### Why Is Freq. Pattern Mining Important?

• Freq. pattern: An intrinsic and important property of datasets
• Foundation for many essential data mining tasks
• Association, correlation, and causality analysis
• Sequential, structural (e.g., sub-graph) patterns
• Pattern analysis in spatiotemporal, multimedia, time-series, and stream data
• Classification: discriminative, frequent pattern analysis
• Cluster analysis: frequent pattern-based clustering
• Data warehousing: iceberg cube and cube-gradient
• Semantic data compression: fascicles

### Basic Concepts: Frequent Patterns

 itemset: A set of one or more itemsk-itemset X = {x1, …, xk}(absolute) support, or, support count of X: Frequency or occurrence of an itemset X(relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X)An itemset X is frequent if X’s support is no less than a minsup threshold

### Basic Concepts: Association Rules

 Find all the rules X → Y with minimum support and confidencesupport, s, probability that a transaction contains X U Yconfidence, c, conditional probability that a transaction having X also contains YLet  minsup = 50%, minconf = 50%Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3Association rules: (many more!)Beer → Diaper  (60%, 100%)Diaper → Beer  (60%, 75%)

### Closed Patterns and Max-Patterns

• A long pattern contains a combinatorial number of sub-patterns, e.g.,
${a_{1}, …, a_{100}}$
contains
$(100^1)+(100^2)+...+(1^{1}0^{0}0^{0}) = 2^{100}-1 = 1.27 * 10^{30}$
sub-patterns!
• Solution: Mine closed patterns and max-patterns instead
• An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99)
• An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD’98)
• Closed pattern is a lossless compression of freq. patterns
• Reducing the # of patterns and rules

### Closed Patterns and Max-Patterns

• Exercise. DB = {< a1, …, a100>, < a1, …, a50>}
• Min_sup = 1.
• What is the set of closed itemset?
• < a1, …, a100>: 1
• < a1, …, a50>: 2
• What is the set of max-pattern?
• < a1, …, a100>: 1
• What is the set of all patterns?
• !!

### Computational Complexity of Frequent Itemset Mining

• How many itemsets are potentially to be generated in the worst case?
• The number of frequent itemsets to be generated is senstive to the minsup threshold
• When minsup is low, there exist potentially an exponential number of frequent itemsets
• The worst case: MN where M: # distinct items, and N: max length of transactions
• The worst case complexty vs. the expected probability
• Ex. Suppose Walmart has 10^4 kinds of products
• The chance to pick up one product 10^(-4)
• The chance to pick up a particular set of 10 products: ~10^(-40)
• What is the chance this particular set of 10 products to be frequent 10^3 times in 10^9 transactions?

### Scalable Frequent Itemset Mining Methods

• Apriori: A Candidate Generation-and-Test Approach
• Improving the Efficiency of Apriori
• FPGrowth: A Frequent Pattern-Growth Approach
• ECLAT: Frequent Pattern Mining with Vertical Data Format

### The Downward Closure Property and Scalable Mining Methods

• The downward closure property of frequent patterns
• Any subset of a frequent itemset must be frequent
• If {beer, diaper, nuts} is frequent, so is {beer, diaper}
• i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper}
• Scalable mining methods: Three major approaches
• Apriori (Agrawal & Srikant@VLDB’94)
• Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
• Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)

### Apriori: A Candidate Generation & Test Approach

• Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
• Method:
• Initially, scan DB once to get frequent 1-itemset
• Generate length (k+1) candidate itemsets from length k frequent itemsets
• Test the candidates against DB
• Terminate when no frequent or candidate set can be generated

### The Apriori Algorithm (Pseudo-Code)

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent items};

for (k = 1; Lk !=Ø; k++) do begin

Ck+1 = candidates generated from Lk;

for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support

end

return Uk Lk;

### Implementation of Apriori

• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning
• Example of Candidate-generation
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• Pruning:
• acde is removed because ade is not in L3
• C4 = {abcd}

### How to Count Supports of Candidates?

• Why counting supports of candidates a problem?
• The total number of candidates can be very huge
• One transaction may contain many candidates
• Method:
• Candidate itemsets are stored in a hash-tree
• Leaf node of hash-tree contains a list of itemsets and counts
• Interior node contains a hash table
• Subset function: finds all the candidates contained in a transaction

### Candidate Generation: An SQL Implementation

• SQL Implementation of candidate generation
• Suppose the items in L(k-1) are listed in an order
• Step 1: self-joining L(k-1)
insert into Ck
select p.item1, p.item2, …, p.item(k-1), q.item(k-1)
from L(k-1) p, L(k-1) q
where p.item1=q.item1, …, p.item(k-2)=q.item(k-2), p.item(k-1) < q.item(k-1)
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in L(k-1)) then delete c from Ck
• Use object-relational extensions like UDFs, BLOBs, and Table functions for efficient implementation [S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD’98]

### Further Improvement of the Apriori Method

• Major computational challenges
• Multiple scans of transaction database
• Huge number of candidates
• Tedious workload of support counting for candidates
• Improving Apriori: general ideas
• Reduce passes of transaction database scans
• Shrink number of candidates
• Facilitate support counting of candidates

### Partition: Scan Database Only Twice

• Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB
• Scan 1: partition database and find local frequent patterns
• Scan 2: consolidate global frequent patterns
• A. Savasere, E. Omiecinski and S. Navathe, VLDB’95

### DHP: Reduce the Number of Candidates

• A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent
• Candidates: a, b, c, d, e
• Hash entries
• {bd, be, de}
• Frequent 1-itemset: a, b, d, e
• ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold
• J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. SIGMOD’95

### Sampling for Frequent Patterns

• Select a sample of original database, mine frequent patterns within sample using Apriori
• Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked
• Example: check abcd instead of ab, ac, …, etc.
• Scan database again to find missed frequent patterns
• H. Toivonen. Sampling large databases for association rules. In VLDB’96

### Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation

• Bottlenecks of the Apriori approach
• Candidate generation and test
• Often generates a huge number of candidates
• The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
• Depth-first search
• Avoid explicit candidate generation
• Major philosophy: Grow long patterns from short ones using local frequent items only
• “abc” is a frequent pattern
• Get all transactions having “abc”, i.e., project DB on abc: DB|abc
• “d” is a local frequent item in DB|abc  abcd is a frequent pattern

### Partition Patterns and Databases

• Frequent patterns can be partitioned into subsets according to f-list
• F-list = f-c-a-b-m-p
• Patterns containing p
• Patterns having m but no p
• Patterns having c but no a nor b, m, p
• Pattern f
• Completeness and non-redundency

### Find Patterns Having P From P-conditional Database

• Starting at the frequent item header table in the FP-tree
• Traverse the FP-tree by following the link of each frequent item p
• Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base

### From Conditional Pattern-bases to Conditional FP-trees

• For each pattern-base
• Accumulate the count for each item in the base
• Construct the FP-tree for the frequent items of the pattern base

### A Special Case: Single Prefix Path in FP-tree

• Suppose a (conditional) FP-tree T has a shared single prefix-path P
• Mining can be decomposed into two parts
• Reduction of the single prefix path into one node
• Concatenation of the mining results of the two parts

### Benefits of the FP-tree Structure

• Completeness
• Preserve complete information for frequent pattern mining
• Never break a long pattern of any transaction
• Compactness
• Reduce irrelevant info—infrequent items are gone
• Items in frequency descending order: the more frequently occurring, the more likely to be shared
• Never be larger than the original database (not count node-links and the count field)

### The Frequent Pattern Growth Mining Method

• Idea: Frequent pattern growth
• Recursively grow frequent patterns by pattern and database partition
• Method
• For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree
• Repeat the process on each newly created conditional FP-tree
• Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern

### Scaling FP-growth by Database Projection

• What about if FP-tree cannot fit in memory?
• DB projection
• First partition a database into a set of projected DBs
• Then construct and mine FP-tree for each projected DB
• Parallel projection vs. partition projection techniques
• Parallel projection
• Project the DB in parallel for each frequent item
• Parallel projection is space costly
• All the partitions can be processed in parallel
• Partition projection
• Partition the DB based on the ordered frequent items
• Passing the unprocessed parts to the subsequent partitions

### Advantages of the Pattern Growth Approach

• Divide-and-conquer:
• Decompose both the mining task and DB according to the frequent patterns obtained so far
• Lead to focused search of smaller databases
• Other factors
• No candidate generation, no candidate test
• Compressed database: FP-tree structure
• No repeated scan of entire database
• Basic ops: counting local freq items and building sub FP-tree, no pattern search and matching
• A good open-source implementation and refinement of FPGrowth
• FPGrowth+ (Grahne and J. Zhu, FIMI'03)

### Further Improvements of Mining Methods

• AFOPT (Liu, et al. @ KDD’03)
• A “push-right” method for mining condensed frequent pattern (CFP) tree
• Carpenter (Pan, et al. @ KDD’03)
• Mine data sets with small rows but numerous columns
• Construct a row-enumeration tree for efficient mining
• FPgrowth+ (Grahne and Zhu, FIMI’03)
• Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc. ICDM'03 Int. Workshop on Frequent Itemset Mining Implementations (FIMI'03), Melbourne, FL, Nov. 2003
• TD-Close (Liu, et al, SDM’06)

### Extension of Pattern Growth Mining Methodology

• Mining closed frequent itemsets and max-patterns
• CLOSET (DMKD’00), FPclose, and FPMax (Grahne & Zhu, Fimi’03)
• Mining sequential patterns
• PrefixSpan (ICDE’01), CloSpan (SDM’03), BIDE (ICDE’04)
• Mining graph patterns
• gSpan (ICDM’02), CloseGraph (KDD’03)
• Constraint-based mining of frequent patterns
• Convertible constraints (ICDE’01), gPrune (PAKDD’03)
• Computing iceberg data cubes with complex measures
• H-tree, H-cubing, and Star-cubing (SIGMOD’01, VLDB’03)
• Pattern-growth-based Clustering
• MaPle (Pei, et al., ICDM’03)
• Pattern-Growth-Based Classification
• Mining frequent and discriminative patterns (Cheng, et al, ICDE’07)

### ECLAT: Mining by Exploring Vertical Data Format

• Vertical format: t(AB) = {T11, T25, …}
• tid-list: list of trans.-ids containing an itemset
• Deriving frequent patterns based on vertical intersections
• t(X) = t(Y): X and Y always happen together
• t(X) ⊂ t(Y): transaction having X always has Y
• Using diffset to accelerate mining
• Only keep track of differences of tids
• t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
• Diffset (XY, X) = {T2}
• Eclat (Zaki et al. @KDD’97)
• Mining Closed patterns using vertical format: CHARM (Zaki & Hsiao@SDM’02)

### Mining Frequent Closed Patterns: CLOSET

• Flist: list of all frequent items in support ascending order
• Flist: d-a-f-e-c
• Divide search space
• Patterns having d
• Patterns having d but no a, etc.
• Find frequent closed pattern recursively
• Every transaction having d also has cfa → cfad is a frequent closed pattern
• J. Pei, J. Han & R. Mao. “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00.

### CLOSET+: Mining Closed Itemsets by Pattern-Growth

• Itemset merging: if Y appears in every occurrence of X, then Y is merged with X
• Sub-itemset pruning: if Y כ X, and sup(X) = sup(Y), X and all of X’s descendants in the set enumeration tree can be pruned
• Hybrid tree projection
• Bottom-up physical tree-projection
• Top-down pseudo tree-projection
• Item skipping: if a local frequent item has the same support in several header tables at different levels, one can prune it from the header table at higher levels
• Efficient subset checking

### CHARM: Mining by Exploring Vertical Data Format

• Vertical format: t(AB) = {T11, T25, …}
• tid-list: list of trans.-ids containing an itemset
• Deriving closed patterns based on vertical intersections
• t(X) = t(Y): X and Y always happen together
• t(X) ⊂ t(Y): transaction having X always has Y
• Using diffset to accelerate mining
• Only keep track of differences of tids
• t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
• Diffset (XY, X) = {T2}
• Eclat/MaxEclat (Zaki et al. @KDD’97), VIPER(P. Shenoy et al.@SIGMOD’00), CHARM (Zaki & Hsiao@SDM’02)

### Interestingness Measure: Correlations (Lift)

• The overall % of students eating cereal is 75% > 66.7%.
• play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence
• Measure of dependent/correlated events: lift

### Are lift and X^2 Good Measures of Correlation?

• Support and confidence are not good to indicate correlations
• Over 20 interestingness measures have been proposed (see Tan, Kumar, Sritastava @KDD’02)
• Which are good ones?

### Which Null-Invariant Measure Is Better?

• IR (Imbalance Ratio): measure the imbalance of two itemsets A and B in rule implications

$IR(A,B) = \frac {|sup(A)-sup(B)|} {sup(A)+sup(B)-sup(AUB)}$

• Kulczynski and Imbalance Ratio (IR) together present a clear picture for all the three datasets D4 through D6
• D4 is balanced & neutral
• D5 is imbalanced & neutral
• D6 is very imbalanced & neutral

### Summary

• Basic concepts: association rules, support-confident framework, closed and max-patterns
• Scalable frequent pattern mining methods
• Apriori (Candidate generation & test)
• Projection-based (FPgrowth, CLOSET+, ...)
• Vertical format approach (ECLAT, CHARM, ...)
• Which patterns are interesting?
• Pattern evaluation methods

### References

• Basic Concepts of Frequent Pattern Mining (Association Rules)
• R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93.
• (Max-pattern) R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98.
• (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99.
• (Sequential pattern) R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95
• Apriori and Its Improvements
• R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94.
• H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94.
• A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. VLDB'95.
• J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95.
• H. Toivonen. Sampling large databases for association rules. VLDB'96.
• S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket analysis. SIGMOD'97.
• S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98.

### References (cont')

• Depth-First, Projection-Based FP Mining
• R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. J. Parallel and Distributed Computing:02.
• J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation . SIGMOD’ 00.
• J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic Projection. KDD'02.
• J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without Minimum Support. ICDM'02.
• J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets. KDD'03.
• G. Liu, H. Lu, W. Lou, J. X. Yu. On Computing, Storing and Querying Frequent Patterns. KDD'03.
• G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc. ICDM'03 Int. Workshop on Frequent Itemset Mining Implementations (FIMI'03), Melbourne, FL, Nov. 2003

### References (cont')

• Vertical Format and Row Enumeration Methods
• M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. DAMI:97.
• Zaki and Hsiao. CHARM: An Efficient Algorithm for Closed Itemset Mining, SDM'02.
• C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints. KDD’02.
• F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: Finding Closed Patterns in Long Biological Datasets. KDD'03.
• H. Liu, J. Han, D. Xin, and Z. Shao, Mining Interesting Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach, SDM'06.

### References (cont')

• Mining Correlations and Interesting Rules
• M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94.
• S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97.
• C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. VLDB'98.
• P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. KDD'02.
• E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03.
• T. Wu, Y. Chen and J. Han, “Association Mining in Large Databases: A Re-Examination of Its Measures”, PKDD'07
•  Freq. pattern Mining Applications
• Y. Huhtala, J. Kärkkäinen, P. Porkka, H. Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. ICDE’98.
• H. V. Jagadish, J. Madar, and R. Ng. Semantic Compression and Pattern Extraction with Fascicles. VLDB'99.
•  T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining Database Structure; or How to Build a Data Quality Browser. SIGMOD'02.
• K. Wang, S. Zhou, J. Han. Profit Mining: From Patterns to Actions. EDBT’02.

• March 13, 2013

Creator: sidraaslam

Contributors:
ali1k (VU Amsterdam)