
Agenda
- Basic Concepts
- Frequent Itemset Mining Methods
- Which Patterns Are Interesting?—Pattern Evaluation Methods
- Summary
What Is Frequent Pattern Analysis?
- Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
- First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining
- Motivation: Finding inherent regularities in data
- What products were often purchased together?— Beer and diapers?!
- What are the subsequent purchases after buying a PC?
- What kinds of DNA are sensitive to this new drug?
- Can we automatically classify web documents?
- Applications
- Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.
Why Is Freq. Pattern Mining Important?
- Freq. pattern: An intrinsic and important property of datasets
- Foundation for many essential data mining tasks
- Association, correlation, and causality analysis
- Sequential, structural (e.g., sub-graph) patterns
- Pattern analysis in spatiotemporal, multimedia, time-series, and stream data
- Classification: discriminative, frequent pattern analysis
- Cluster analysis: frequent pattern-based clustering
- Data warehousing: iceberg cube and cube-gradient
- Semantic data compression: fascicles
- Broad applications
Basic Concepts: Frequent Patterns
|
Basic Concepts: Association Rules
|
|
Closed Patterns and Max-Patterns
-
A long pattern contains a combinatorial number of sub-patterns, e.g.,
\[ {a_{1}, …, a_{100}} \]
contains
\[ (100^1)+(100^2)+...+(1^{1}0^{0}0^{0}) = 2^{100}-1 = 1.27 * 10^{30}\]
sub-patterns! - Solution: Mine closed patterns and max-patterns instead
- An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99)
- An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD’98)
- Closed pattern is a lossless compression of freq. patterns
- Reducing the # of patterns and rules
Closed Patterns and Max-Patterns
- Exercise. DB = {< a1, …, a100>
, < a1, …, a50>} - Min_sup = 1.
- What is the set of closed itemset?
< a1, …, a100>: 1 - < a1, …, a50>: 2
- What is the set of max-pattern?
< a1, …, a100>: 1 - What is the set of all patterns?
- !!
Computational Complexity of Frequent Itemset Mining
- How many itemsets are potentially to be generated in the worst case?
- The number of frequent itemsets to be generated is senstive to the minsup threshold
- When minsup is low, there exist potentially an exponential number of frequent itemsets
- The worst case: MN where M: # distinct items, and N: max length of transactions
- The worst case complexty vs. the expected probability
- Ex. Suppose Walmart has 10^4 kinds of products
- The chance to pick up one product 10^(-4)
- The chance to pick up a particular set of 10 products: ~10^(-40)
- What is the chance this particular set of 10 products to be frequent 10^3 times in 10^9 transactions?
Scalable Frequent Itemset Mining Methods
- Apriori: A Candidate Generation-and-Test Approach
- Improving the Efficiency of Apriori
- FPGrowth: A Frequent Pattern-Growth Approach
- ECLAT: Frequent Pattern Mining with Vertical Data Format
The Downward Closure Property and Scalable Mining Methods
- The downward closure property of frequent patterns
- Any subset of a frequent itemset must be frequent
- If {beer, diaper, nuts} is frequent, so is {beer, diaper}
- i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper}
- Scalable mining methods: Three major approaches
- Apriori (Agrawal & Srikant@VLDB’94)
- Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
- Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)
Apriori: A Candidate Generation & Test Approach
- Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
- Method:
- Initially, scan DB once to get frequent 1-itemset
- Generate length (k+1) candidate itemsets from length k frequent itemsets
- Test the candidates against DB
- Terminate when no frequent or candidate set can be generated
The Apriori Algorithm—An Example
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=Ø; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return Uk Lk;
Implementation of Apriori
- How to generate candidates?
- Step 1: self-joining Lk
- Step 2: pruning
- Example of Candidate-generation
- L3={abc, abd, acd, ace, bcd}
- Self-joining: L3*L3
- abcd from abc and abd
- acde from acd and ace
- Pruning:
- acde is removed because ade is not in L3
- C4 = {abcd}
How to Count Supports of Candidates?
- Why counting supports of candidates a problem?
- The total number of candidates can be very huge
- One transaction may contain many candidates
- Method:
- Candidate itemsets are stored in a hash-tree
- Leaf node of hash-tree contains a list of itemsets and counts
- Interior node contains a hash table
- Subset function: finds all the candidates contained in a transaction
Counting Supports of Candidates Using Hash Tree
Candidate Generation: An SQL Implementation
- SQL Implementation of candidate generation
- Suppose the items in L(k-1) are listed in an order
- Step 1: self-joining L(k-1)
insert into Ck
select p.item1, p.item2, …, p.item(k-1), q.item(k-1)
from L(k-1) p, L(k-1) q
where p.item1=q.item1, …, p.item(k-2)=q.item(k-2), p.item(k-1) < q.item(k-1) - Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in L(k-1)) then delete c from Ck - Use object-relational extensions like UDFs, BLOBs, and Table functions for efficient implementation [S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD’98]
Further Improvement of the Apriori Method
- Major computational challenges
- Multiple scans of transaction database
- Huge number of candidates
- Tedious workload of support counting for candidates
- Improving Apriori: general ideas
- Reduce passes of transaction database scans
- Shrink number of candidates
- Facilitate support counting of candidates
Partition: Scan Database Only Twice
- Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB
- Scan 1: partition database and find local frequent patterns
- Scan 2: consolidate global frequent patterns
- Scan 1: partition database and find local frequent patterns
- A. Savasere, E. Omiecinski and S. Navathe, VLDB’95
DHP: Reduce the Number of Candidates
- A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent
- Candidates: a, b, c, d, e
- Hash entries
- {ab, ad, ae}
- {bd, be, de}
- …
- Frequent 1-itemset: a, b, d, e
- ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold
- J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. SIGMOD’95
Sampling for Frequent Patterns
- Select a sample of original database, mine frequent patterns within sample using Apriori
- Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked
- Example: check abcd instead of ab, ac, …, etc.
- Scan database again to find missed frequent patterns
- H. Toivonen. Sampling large databases for association rules. In VLDB’96
DIC: Reduce Number of Scans
Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation
- Bottlenecks of the Apriori approach
- Breadth-first (i.e., level-wise) search
- Candidate generation and test
- Often generates a huge number of candidates
- The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
- Depth-first search
- Avoid explicit candidate generation
- Major philosophy: Grow long patterns from short ones using local frequent items only
- “abc” is a frequent pattern
- Get all transactions having “abc”, i.e., project DB on abc: DB|abc
- “d” is a local frequent item in DB|abc abcd is a frequent pattern
Construct FP-tree from a Transaction Database
Partition Patterns and Databases
- Frequent patterns can be partitioned into subsets according to f-list
- F-list = f-c-a-b-m-p
- Patterns containing p
- Patterns having m but no p
- …
- Patterns having c but no a nor b, m, p
- Pattern f
- Completeness and non-redundency
Find Patterns Having P From P-conditional Database
- Starting at the frequent item header table in the FP-tree
- Traverse the FP-tree by following the link of each frequent item p
- Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base
From Conditional Pattern-bases to Conditional FP-trees
- For each pattern-base
- Accumulate the count for each item in the base
- Construct the FP-tree for the frequent items of the pattern base
Recursion: Mining Each Conditional FP-tree
A Special Case: Single Prefix Path in FP-tree
- Suppose a (conditional) FP-tree T has a shared single prefix-path P
- Mining can be decomposed into two parts
- Reduction of the single prefix path into one node
- Concatenation of the mining results of the two parts
Benefits of the FP-tree Structure
- Completeness
- Preserve complete information for frequent pattern mining
- Never break a long pattern of any transaction
- Compactness
- Reduce irrelevant info—infrequent items are gone
- Items in frequency descending order: the more frequently occurring, the more likely to be shared
- Never be larger than the original database (not count node-links and the count field)
The Frequent Pattern Growth Mining Method
- Idea: Frequent pattern growth
- Recursively grow frequent patterns by pattern and database partition
- Method
- For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree
- Repeat the process on each newly created conditional FP-tree
- Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern
Scaling FP-growth by Database Projection
- What about if FP-tree cannot fit in memory?
- DB projection
- First partition a database into a set of projected DBs
- Then construct and mine FP-tree for each projected DB
- Parallel projection vs. partition projection techniques
- Parallel projection
- Project the DB in parallel for each frequent item
- Parallel projection is space costly
- All the partitions can be processed in parallel
- Partition projection
- Partition the DB based on the ordered frequent items
- Passing the unprocessed parts to the subsequent partitions
Partition-Based Projection
FP-Growth vs. Apriori: Scalability With the Support Threshold
FP-Growth vs. Tree-Projection: Scalability with the Support Threshold
Advantages of the Pattern Growth Approach
- Divide-and-conquer:
- Decompose both the mining task and DB according to the frequent patterns obtained so far
- Lead to focused search of smaller databases
- Other factors
- No candidate generation, no candidate test
- Compressed database: FP-tree structure
- No repeated scan of entire database
- Basic ops: counting local freq items and building sub FP-tree, no pattern search and matching
- A good open-source implementation and refinement of FPGrowth
- FPGrowth+ (Grahne and J. Zhu, FIMI'03)
Further Improvements of Mining Methods
- AFOPT (Liu, et al. @ KDD’03)
- A “push-right” method for mining condensed frequent pattern (CFP) tree
- Carpenter (Pan, et al. @ KDD’03)
- Mine data sets with small rows but numerous columns
- Construct a row-enumeration tree for efficient mining
- FPgrowth+ (Grahne and Zhu, FIMI’03)
- Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc. ICDM'03 Int. Workshop on Frequent Itemset Mining Implementations (FIMI'03), Melbourne, FL, Nov. 2003
- TD-Close (Liu, et al, SDM’06)
Extension of Pattern Growth Mining Methodology
- Mining closed frequent itemsets and max-patterns
- CLOSET (DMKD’00), FPclose, and FPMax (Grahne & Zhu, Fimi’03)
- Mining sequential patterns
- PrefixSpan (ICDE’01), CloSpan (SDM’03), BIDE (ICDE’04)
- Mining graph patterns
- gSpan (ICDM’02), CloseGraph (KDD’03)
- Constraint-based mining of frequent patterns
- Convertible constraints (ICDE’01), gPrune (PAKDD’03)
- Computing iceberg data cubes with complex measures
- H-tree, H-cubing, and Star-cubing (SIGMOD’01, VLDB’03)
- Pattern-growth-based Clustering
- MaPle (Pei, et al., ICDM’03)
- Pattern-Growth-Based Classification
- Mining frequent and discriminative patterns (Cheng, et al, ICDE’07)
ECLAT: Mining by Exploring Vertical Data Format
- Vertical format: t(AB) = {T11, T25, …}
- tid-list: list of trans.-ids containing an itemset
- Deriving frequent patterns based on vertical intersections
- t(X) = t(Y): X and Y always happen together
- t(X) ⊂ t(Y): transaction having X always has Y
- Using diffset to accelerate mining
- Only keep track of differences of tids
- t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
- Diffset (XY, X) = {T2}
- Eclat (Zaki et al. @KDD’97)
- Mining Closed patterns using vertical format: CHARM (Zaki & Hsiao@SDM’02)
Mining Frequent Closed Patterns: CLOSET
- Flist: list of all frequent items in support ascending order
- Flist: d-a-f-e-c
- Divide search space
- Patterns having d
- Patterns having d but no a, etc.
- Find frequent closed pattern recursively
- Every transaction having d also has cfa → cfad is a frequent closed pattern
- J. Pei, J. Han & R. Mao. “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00.
CLOSET+: Mining Closed Itemsets by Pattern-Growth
- Itemset merging: if Y appears in every occurrence of X, then Y is merged with X
- Sub-itemset pruning: if Y כ X, and sup(X) = sup(Y), X and all of X’s descendants in the set enumeration tree can be pruned
- Hybrid tree projection
- Bottom-up physical tree-projection
- Top-down pseudo tree-projection
- Item skipping: if a local frequent item has the same support in several header tables at different levels, one can prune it from the header table at higher levels
- Efficient subset checking
MaxMiner: Mining Max-Patterns
CHARM: Mining by Exploring Vertical Data Format
- Vertical format: t(AB) = {T11, T25, …}
- tid-list: list of trans.-ids containing an itemset
- Deriving closed patterns based on vertical intersections
- t(X) = t(Y): X and Y always happen together
- t(X) ⊂ t(Y): transaction having X always has Y
- Using diffset to accelerate mining
- Only keep track of differences of tids
- t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
- Diffset (XY, X) = {T2}
- Eclat/MaxEclat (Zaki et al. @KDD’97), VIPER(P. Shenoy et al.@SIGMOD’00), CHARM (Zaki & Hsiao@SDM’02)
Visualization of Association Rules: Plane Graph

Visualization of Association Rules: Rule Graph

Visualization of Association Rules (SGI/MineSet 3.0)

Interestingness Measure: Correlations (Lift)
- play basketball ⇒ eat cereal [40%, 66.7%] is misleading
- The overall % of students eating cereal is 75% > 66.7%.
- play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence
- Measure of dependent/correlated events: lift
Are lift and X^2 Good Measures of Correlation?
- “Buy walnuts ⇒ buy milk [1%, 80%]” is misleading if 85% of customers buy milk
- Support and confidence are not good to indicate correlations
- Over 20 interestingness measures have been proposed (see Tan, Kumar, Sritastava @KDD’02)
- Which are good ones?

Null-Invariant Measures

Comparison of Interestingness Measures

Analysis of DBLP Coauthor Relationships

Which Null-Invariant Measure Is Better?
- IR (Imbalance Ratio): measure the imbalance of two itemsets A and B in rule implications
\[ IR(A,B) = \frac {|sup(A)-sup(B)|} {sup(A)+sup(B)-sup(AUB)}\]
- Kulczynski and Imbalance Ratio (IR) together present a clear picture for all the three datasets D4 through D6
- D4 is balanced & neutral
- D5 is imbalanced & neutral
- D6 is very imbalanced & neutral

Summary
- Basic concepts: association rules, support-confident framework, closed and max-patterns
- Scalable frequent pattern mining methods
- Apriori (Candidate generation & test)
- Projection-based (FPgrowth, CLOSET+, ...)
- Vertical format approach (ECLAT, CHARM, ...)
- Which patterns are interesting?
- Pattern evaluation methods
References
-
Basic Concepts of Frequent Pattern Mining
(Association Rules)
- R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93.
- (Max-pattern) R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98.
- (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99.
- (Sequential pattern) R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95
- Apriori and Its Improvements
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94.
- H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94.
- A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. VLDB'95.
- J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95.
- H. Toivonen. Sampling large databases for association rules. VLDB'96.
- S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket analysis. SIGMOD'97.
- S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98.
References (cont')
-
Depth-First, Projection-Based FP Mining
- R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. J. Parallel and Distributed Computing:02.
- J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation . SIGMOD’ 00.
- J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic Projection. KDD'02.
- J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without Minimum Support. ICDM'02.
- J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets. KDD'03.
- G. Liu, H. Lu, W. Lou, J. X. Yu. On Computing, Storing and Querying Frequent Patterns. KDD'03.
- G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc. ICDM'03 Int. Workshop on Frequent Itemset Mining Implementations (FIMI'03), Melbourne, FL, Nov. 2003
References (cont')
-
Vertical Format and Row Enumeration Methods
- M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. DAMI:97.
- Zaki and Hsiao. CHARM: An Efficient Algorithm for Closed Itemset Mining, SDM'02.
- C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints. KDD’02.
- F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: Finding Closed Patterns in Long Biological Datasets. KDD'03.
- H. Liu, J. Han, D. Xin, and Z. Shao, Mining Interesting Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach, SDM'06.
References (cont')
-
Mining Correlations and Interesting Rules
- M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94.
- S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97.
- C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. VLDB'98.
- P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. KDD'02.
- E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03.
- T. Wu, Y. Chen and J. Han, “Association Mining in Large Databases: A Re-Examination of Its Measures”, PKDD'07
- Freq. pattern Mining Applications
- Y. Huhtala, J. Kärkkäinen, P. Porkka, H. Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. ICDE’98.
- H. V. Jagadish, J. Madar, and R. Ng. Semantic Compression and Pattern Extraction with Fascicles. VLDB'99.
- T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining Database Structure; or How to Build a Data Quality Browser. SIGMOD'02.
- K. Wang, S. Zhou, J. Han. Profit Mining: From Patterns to Actions. EDBT’02.
- March 13, 2013
