This website uses cookies to ensure you get the best experience on our website.
Got it
Search
Add deck
English
English
German
Italian
Spanish
Dutch
Greek
Catalan
Serbian
Sign In
Sign In
E-Mail
Password
Sign In
I can not access my account
Don't have an account? Sign up here.
Close
Data Mining
Deck Explorer
Introduction
No title
Agenda
Why Data Mining?
Why Data Mining?
Evolution of Sciences: New Data Science Era
What Is Data Mining?
What Is Data Mining?
Knowledge Discovery (KDD) Process
Example: A Web Mining Framework
Data Mining in Business Intelligence
KDD Process: A Typical View from ML and Statistics
Which View Do You Prefer?
A Multi-Dimensional View of Data Mining
Multi-Dimensional View of Data Mining
What Kinds of Data Can Be Mined?
Data Mining: On What Kinds of Data?
What Kinds of Patterns Can Be Mined?
Data Mining Function: Generalization
Data Mining Function: Association and Correlation Analysis
Data Mining Function: Classification
Data Mining Function: Cluster Analysis
Data Mining Function: Outlier Analysis
Time and Ordering: Sequential Pattern, Trend and Evolution Analysis
Structure and Network Analysis
Evaluation of Knowledge
What Kinds of Technologies Are Used?
Data Mining: Confluence of Multiple Disciplines
Why Confluence of Multiple Disciplines?
What Kinds of Applications Are Targeted?
Applications of Data Mining
Major Issues in Data Mining
Major Issues in Data Mining
Major Issues in Data Mining (cont')
A Brief History of Data Mining and Data Mining Society
A Brief History of Data Mining Society
Conferences and Journals on Data Mining
Where to Find References? DBLP, CiteSeer, Google
Recommended Reference Books
Summary
Getting to Know Your Data
No title
Agenda
Data Objects and Attribute Types
Types of Data Sets
Important Characteristics of Structured Data
Data Objects
Attributes
Attribute Types
Numeric Attribute Types
Discrete vs. Continuous Attributes
Basic Statistical Descriptions of Data
Basic Statistical Descriptions of Data
Measuring the central tendency
Measuring the Central Tendency
Measuring the Central Tendency (cont')
Symmetric vs. Skewed Data
Measuring the dispersion of data
Measuring the Dispersion of Data
Boxplot Analysis
Visualization of Data Dispersion: 3-D Boxplots
Properties of Normal Distribution Curve
Graphic Displays of Basic Statistical Descriptions
Histogram Analysis
Histograms Often Tell More than Boxplots
Quantile Plot
Quantile-Quantile (Q-Q) Plot
Scatter plot
Positively and Negatively Correlated Data
Uncorrelated Data
Data Visualization
Data Visualization
Pixel-oriented visualization techniques
Pixel-Oriented Visualization Techniques
Laying Out Pixels in Circle Segments
Geometric projection visualization techniques
Geometric Projection Visualization Techniques
Direct Data Visualization
Scatterplot Matrices
Landscapes
Parallel Coordinates
Parallel Coordinates of a Data Set
Icon-based visualization techniques
Icon-Based Visualization Techniques
Chernoff Faces
Stick Figure
Hierarchical Visualization Techniques
Hierarchical Visualization Techniques
Dimensional Stacking
Dimensional Stacking
Worlds-within-Worlds
Tree-Map
InfoCube
Three-D Cone Trees
Visualizing Complex Data and Relations
Measuring Data Similarity and Dissimilarity
Similarity and Dissimilarity
Data Matrix and Dissimilarity Matrix
Proximity Measure for Nominal Attributes
Proximity Measure for Binary Attributes
Dissimilarity between Binary Variables
Standardizing Numeric Data
Example: Data Matrix and Dissimilarity Matrix
Distance on Numeric Data: Minkowski Distance
Special Cases of Minkowski Distance
Example: Minkowski Distance
Ordinal Variables
Attributes of Mixed Type
Cosine Similarity
Example: Cosine Similarity
Summary
References
Data Preprocessing
No title
Agenda
Data Preprocessing: An Overview
Data Quality: Why Preprocess the Data?
Major Tasks in Data Preprocessing
Data Cleaning
Data Cleaning
Incomplete (Missing) Data
How to Handle Missing Data?
Noisy Data
How to Handle Noisy Data?
Data Cleaning as a Process
Data Integration
Data Integration
Handling Redundancy in Data Integration
Correlation Analysis (Nominal Data)
Chi-Square Calculation: An Example
Correlation Analysis (Numeric Data)
Visually Evaluating Correlation
Correlation (viewed as linear relationship)
Covariance (Numeric Data)
Co-Variance: An Example
Data Reduction
Data Reduction Strategies
Data Reduction 1: Dimensionality Reduction
Mapping Data to a New Space
What Is Wavelet Transform?
Wavelet Transformation
Wavelet Decomposition
Why Wavelet Transform?
Principal Component Analysis (PCA)
Principal Component Analysis (Steps)
Attribute Subset Selection
Heuristic Search in Attribute Selection
Attribute Creation (Feature Generation)
Data Reduction 2: Numerosity Reduction
Parametric Data Reduction: Regression and Log-Linear Models
Regression Analysis
Regress Analysis and Log-Linear Models
Histogram Analysis
Clustering
Sampling
Types of Sampling
Sampling: With or without Replacement
Sampling: Cluster or Stratified Sampling
Data Cube Aggregation
Data Reduction 3: Data Compression
Data Compression
Data Transformation and Data Discretization
Data Transformation
Normalization
Discretization
Data Discretization Methods
Simple Discretization: Binning
Binning Methods for Data Smoothing
Discretization Without Using Class Labels(Binning vs. Clustering)
Discretization by Classification & Correlation Analysis
Concept Hierarchy Generation
Concept Hierarchy Generation for Nominal Data
Automatic Concept Hierarchy Generation
Summary
References
No title
Data Warehousing and On-line Analytical Processing
No title
Agenda
Data Warehouse: Basic Concepts
What is a Data Warehouse?
Data Warehouse—Subject-Oriented
Data Warehouse—Integrated
Data Warehouse—Time Variant
Data Warehouse—Nonvolatile
OLTP vs. OLAP
Why a Separate Data Warehouse?
Data Warehouse: A Multi-Tiered ArchitectureUntitled
Three Data Warehouse Models
Extraction, Transformation, and Loading (ETL)
Metadata Repository
Data Warehouse Modeling: Data Cube and OLAP
From Tables and Spreadsheets to Data Cubes
Cube: A Lattice of Cuboids
Conceptual Modeling of Data Warehouses
Example of Star Schema
Example of Snowflake Schema
Example of Fact Constellation
A Concept Hierarchy: Dimension (location)
Data Cube Measures: Three Categories
View of Warehouses and Hierarchies
Multidimensional Data
A Sample Data Cube
Cuboids Corresponding to the Cube
Typical OLAP Operations
Typical OLAP Operations
A Star-Net Query Model
Browsing a Data Cube
Data Warehouse Design and Usage
Design of Data Warehouse: A Business Analysis Framework
Data Warehouse Design Process
Data Warehouse Development: A Recommended Approach
Data Warehouse Usage
From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM)
Data Warehouse Implementation
Efficient Data Cube Computation
The “Compute Cube” Operator
Indexing OLAP Data: Bitmap Index
Indexing OLAP Data: Join Indices
Efficient Processing OLAP Queries
OLAP Server Architectures
Data Generalization by Attribute-Oriented Induction
Attribute-Oriented Induction
Attribute-Oriented Induction: An Example
Class Characterization: An Example
Basic Principles of Attribute-Oriented Induction
Attribute-Oriented Induction: Basic Algorithm
Presentation of Generalized Results
Mining Class Comparisons
Concept Description vs. Cube-Based OLAP
Summary
References
References (cont')
No title
Data Cube Technology
No title
Agenda
Data Cube Computation: Preliminary Concepts
Data Cube: A Lattice of Cuboids
Data Cube: A Lattice of Cuboids
Cube Materialization: Full Cube vs. Iceberg Cube
Iceberg Cube, Closed Cube & Cube Shell
Roadmap for Efficient Computation
General Heuristics (Agarwal et al. VLDB’96)
Data Cube Computation Methods
Data Cube Computation Methods
Multi-Way Array Aggregation
Multi-Way Array Aggregation
Multi-way Array Aggregation for Cube Computation (MOLAP)
Multi-way Array Aggregation for Cube Computation (3-D to 2-D)
Multi-way Array Aggregation for Cube Computation (2-D to 1-D)
Multi-Way Array Aggregation for Cube Computation (Method Summary)
BUC
Bottom-Up Computation (BUC)
BUC: Partitioning
Star-Cubing
Star-Cubing: An Integrating Method
Iceberg Pruning in Shared Dimensions
Cell Trees
Star Attributes and Star Nodes
Example: Star Reduction
Star Tree
Star-Cubing Algorithm—DFS on Lattice Tree
Multi-Way Aggregation
Star-Cubing Algorithm—DFS on Star-Tree
Multi-Way Star-Tree Aggregation
Multi-Way Aggregation (2)
High-Dimensional OLAP
The Curse of Dimensionality
Motivation of High-D OLAP
Fast High-D OLAP with Minimal Cubing
Properties of Proposed Method
Example Computation
1-D Inverted Indices
Shell Fragment Cubes: Ideas
Shell Fragment Cubes: Size and Design
ID_Measure Table
The Frag-Shells Algorithm
Frag-Shells (cont')
Online Query Computation: Query
Online Query Computation: Method
Online Query Computation: Sketch
Experiment: Size vs. Dimensionality (50 and 100 cardinality)
Experiments on Real World Data
No title
Processing Advanced Queries by Exploring Data Cube Technology
Processing Advanced Queries by Exploring Data Cube Technology
Statistical Surveys and OLAP
Surveys: Sample vs. Whole Population
Problems for Drilling in Multidim. Space
OLAP on Survey (i.e., Sampling) Data
Challenges for OLAP on Sampling Data
Example 1: Confidence Interval
Confidence Interval
Efficient Computing Confidence Interval Measures
Example 2: Query Expansion
Boosting Confidence by Query Expansion
Intra-Cuboid Expansion: Choice 1
Intra-Cuboid Expansion: Choice 2
Query Expansion
Intra-Cuboid Expansion
Inter-Cuboid Expansion
Query Expansion Experiments
Multidimensional Data Analysis in Cube Space
Ranking Cube
Ranking Cubes – Efficient Computation of Ranking queries
Ranking Cube: Partition Data on Both Selection and Ranking Dimensions
Materialize Ranking-Cube
Search with Ranking-Cube: Simultaneously Push Selection and Ranking
Processing Ranking Query: Execution Trace
Ranking Cube: Methodology and Extension
Sampling Cube
Prediction Cubes: Data Mining in Multi-Dimensional Cube Space
Data Mining in Cube Space
Prediction Cubes
How to Determine the Prediction Power of an Attribute?
Efficient Computation of Prediction Cubes
Complex Aggregation at Multiple Granularities: Multi-Feature Cubes
Discovery-Driven Exploration of Data Cubes
Discovery-Driven Exploration of Data Cubes
Kinds of Exceptions and their Computation
Examples: Discovery-Driven Data Cubes
H-Cubing
H-Cubing: Using H-Tree Structure
H-tree: A Prefix Hyper-tree
Computing Cells Involving “City”
Computing Cells Involving Month But No City
Computing Cells Involving Only Cust_grp
Summary
References
References (cont')
References (cont')
No title
Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods
No title
Agenda
Basic Concepts
What Is Frequent Pattern Analysis?
Why Is Freq. Pattern Mining Important?
Basic Concepts: Frequent Patterns
Basic Concepts: Association Rules
Closed Patterns and Max-Patterns
Closed Patterns and Max-Patterns
Computational Complexity of Frequent Itemset Mining
Frequent Itemset Mining Methods
Scalable Frequent Itemset Mining Methods
The Downward Closure Property and Scalable Mining Methods
Apriori: A Candidate Generation-and-Test Approach
Apriori: A Candidate Generation & Test Approach
The Apriori Algorithm—An Example
The Apriori Algorithm (Pseudo-Code)
Implementation of Apriori
How to Count Supports of Candidates?
Counting Supports of Candidates Using Hash Tree
Candidate Generation: An SQL Implementation
Improving the Efficiency of Apriori
Further Improvement of the Apriori Method
Partition: Scan Database Only Twice
DHP: Reduce the Number of Candidates
Sampling for Frequent Patterns
DIC: Reduce Number of Scans
FPGrowth: A Frequent Pattern-Growth Approach
Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation
Construct FP-tree from a Transaction Database
Partition Patterns and Databases
Find Patterns Having P From P-conditional Database
From Conditional Pattern-bases to Conditional FP-trees
Recursion: Mining Each Conditional FP-tree
A Special Case: Single Prefix Path in FP-tree
Benefits of the FP-tree Structure
The Frequent Pattern Growth Mining Method
Scaling FP-growth by Database Projection
Partition-Based Projection
FP-Growth vs. Apriori: Scalability With the Support Threshold
FP-Growth vs. Tree-Projection: Scalability with the Support Threshold
Advantages of the Pattern Growth Approach
Further Improvements of Mining Methods
Extension of Pattern Growth Mining Methodology
ECLAT: Mining by Exploring Vertical Data Format
Mining Close Frequent Patterns and Maxpatterns
Mining Frequent Closed Patterns: CLOSET
CLOSET+: Mining Closed Itemsets by Pattern-Growth
MaxMiner: Mining Max-Patterns
CHARM: Mining by Exploring Vertical Data Format
Visualization of Association Rules: Plane Graph
Visualization of Association Rules: Rule Graph
Visualization of Association Rules (SGI/MineSet 3.0)
Which Patterns Are Interesting?—Pattern Evaluation Methods
Interestingness Measure: Correlations (Lift)
Are lift and X^2 Good Measures of Correlation?
Null-Invariant Measures
Comparison of Interestingness Measures
Analysis of DBLP Coauthor Relationships
Which Null-Invariant Measure Is Better?
Summary
References
References (cont')
References (cont')
References (cont')
Advanced Frequent Pattern Mining
No title
Agenda
Pattern Mining in Multi-Level, Multi-Dimensional Space
Mining Multiple-Level Association Rules
Multi-level Association: Flexible Support and Redundancy filtering
Mining Multi-Dimensional Association
Mining Quantitative Associations
Static Discretization of Quantitative Attributes
Quantitative Association Rules Based on Statistical Inference Theory [Aumann and Lindell@DMKD’03]
Negative and Rare Patterns
Defining Negative Correlated Patterns (I)
Defining Negative Correlated Patterns (II)
Constraint-Based Frequent Pattern Mining
Constraint-based (Query-Directed) Mining
Constraints in Data Mining
Meta-Rule Guided Mining
Constraint-Based Frequent Pattern Mining
Pattern Space Pruning with Anti-Monotonicity Constraints
Pattern Space Pruning with Monotonicity Constraints
Data Space Pruning with Data Anti-monotonicity
Pattern Space Pruning with Succinctness
Apriori + Constraint
Constrained Apriori : Push a Succinct Constraint Deep
Constrained FP-Growth: Push a Succinct Constraint Deep
Constrained FP-Growth: Push a Data Anti-monotonic Constraint Deep
Constrained FP-Growth: Push a Data Anti-monotonic Constraint Deep
Convertible Constraints: Ordering Data in Transactions
Strongly Convertible Constraints
Can Apriori Handle Convertible Constraints?
Pattern Space Pruning w. Convertible Constraints
Handling Multiple Constraints
Constraint-Based Mining — A General Picture
What Constraints Are Convertible?
Mining High-Dimensional Data and Colossal Patterns
Mining Colossal Frequent Patterns
Colossal Patterns: A Motivating Example
Colossal Pattern Set: Small but Interesting
Mining Colossal Patterns: Motivation and Philosophy
Alas, A Show of Colossal Pattern Mining!
Methodology of Pattern-Fusion Strategy
Observation: Colossal Patterns and Core Patterns
Robustness of Colossal Patterns
Example: Core Patterns
Robustness of Colossal Patterns
Colossal Patterns Correspond to Dense Balls
Idea of Pattern-Fusion Algorithm
Pattern-Fusion: The Algorithm
Why Is Pattern-Fusion Efficient?
Pattern-Fusion Leads to Good Approximation
Experimental Setting
Experiment Results on Diagn
Experimental Results on ALL
Experimental Results on REPLACE
Experimental Results on REPLACE
Mining Compressed or Approximate Patterns
Mining Compressed Patterns: δ-clustering
Redundancy-Award Top-k Patterns
Pattern Exploration and Application
How to Understand and Interpret Patterns?
A Dictionary Analogy
Semantic Analysis with Context Models
Annotating DBLP Co-authorship & Title Pattern
Summary
References
References (cont')
References (cont')
References (cont')
References (cont')
No title
Classification: Basic Concepts
No title
Agenda
Classification: Basic Concepts
Supervised vs. Unsupervised Learning
Prediction Problems: Classification vs. Numeric Prediction
Classification—A Two-Step Process
Process (1): Model Construction
Process (2): Using the Model in Prediction
Decision Tree Induction
Decision Tree Induction: An Example
Algorithm for Decision Tree Induction
Brief Review of Entropy
Attribute Selection Measure: Information Gain (ID3/C4.5)
Attribute Selection: Information Gain
Computing Information-Gain for Continuous-Valued Attributes
Gain Ratio for Attribute Selection (C4.5)
Gini Index (CART, IBM IntelligentMiner)
Computation of Gini Index
Comparing Attribute Selection Measures
Other Attribute Selection Measures
Overfitting and Tree Pruning
Enhancements to Basic Decision Tree Induction
Classification in Large Databases
Scalability Framework for RainForest
Rainforest: Training Set and Its AVC Sets
BOAT (Bootstrapped Optimistic Algorithm for Tree Construction)
Presentation of Classification Results
Visualization of a Decision Tree in SGI/MineSet 3.0
Interactive Visual Mining by Perception-Based Classification (PBC)
Bayes Classification Methods
Bayesian Classification: Why?
Bayes’ Theorem: Basics
Prediction Based on Bayes’ Theorem
Classification Is to Derive the Maximum Posteriori
Naïve Bayes Classifier
Naïve Bayes Classifier: Training Dataset
Naïve Bayes Classifier: An Example
Avoiding the Zero-Probability Problem
Naïve Bayes Classifier: Comments
Rule-Based Classification
Using IF-THEN Rules for Classification
Rule Extraction from a Decision Tree
Rule Induction: Sequential Covering Method
Sequential Covering Algorithm
Rule Generation
How to Learn-One-Rule?
Model Evaluation and Selection
Model Evaluation and Selection
Evaluation Metrics
Classifier Evaluation Metrics: Confusion Matrix
Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity
Classifier Evaluation Metrics: Precision and Recall, and F-measures
Classifier Evaluation Metrics: Example
Methods for estimating a classifier’s accuracy
Evaluating Classifier Accuracy: Holdout & Cross-Validation Methods
Evaluating Classifier Accuracy: Bootstrap
Comparing classifiers
Confidence intervals
Estimating Confidence Intervals: Classifier Models M1 vs. M2
Estimating Confidence Intervals: Null Hypothesis
Estimating Confidence Intervals: t-test
Estimating Confidence Intervals: Table for t-distribution
Estimating Confidence Intervals: Statistical Significance
Cost-benefit analysis and ROC Curves
Model Selection: ROC Curves
Issues Affecting Model Selection
Issues: Evaluating Classification Methods
Predictor Error Measures
Scalable Decision Tree Induction Methods
Data Cube-Based Decision-Tree Induction
Techniques to Improve Classification Accuracy: Ensemble Methods
Ensemble Methods: Increasing the Accuracy
Bagging: Boostrap Aggregation
Boosting
Adaboost (Freund and Schapire, 1997)
Random Forest (Breiman 2001)
Classification of Class-Imbalanced Data Sets
Summary
References
References (cont')
References (cont')
References (cont')
Classification: Advanced Methods
No title
Agenda
Bayesian Belief Networks
Bayesian Belief Networks
A Bayesian Network and Some of Its CPTs
How Are Bayesian Networks Constructed?
Training Bayesian Networks: Several Scenarios
Classification by Backpropagation
Classification by Backpropagation
Neuron: A Hidden/Output Layer Unit
How A Multi-Layer Neural Network Works
Defining a Network Topology
A Multi-Layer Feed-Forward Neural Network
Backpropagation
Efficiency and Interpretability
Neural Network as a Classifier
Support Vector Machines
Classification: A Mathematical Mapping
Discriminative Classifiers
Perceptron & Winnow
SVM—Support Vector Machines
SVM—History and Applications
SVM—General Philosophy
<