Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets
Evolution of Sciences: New Data Science Era
Before 1600: Empirical science
1600-1950s: Theoretical science
Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.
1950s-1990s: Computational science
Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.
1990-now: Data science
The flood of data from new scientific instruments and simulations
The ability to economically store and manage petabytes of data online
The Internet and computing Grid that makes all these archives universally accessible
Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes
Data mining is a major new challenge!
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
Data mining: a misnomer?
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
Simple search and query processing
(Deductive) expert systems
Knowledge Discovery (KDD) Process
This is a view from typical database systems and data warehousing communities
Data mining plays an essential role in the knowledge discovery process
Example: A Web Mining Framework
Web mining usually involves
Data integration from multiple sources
Warehousing the data
Data cube construction
Data selection for data mining
Presentation of the mining results
Patterns and knowledge to be used or stored into knowledge-base
Data Mining in Business Intelligence
KDD Process: A Typical View from ML and Statistics
This is a view from typical machine learning and statistics communities
Which View Do You Prefer?
Which view do you prefer?
KDD vs. ML/Stat. vs. Business Intelligence
Depending on the data, applications, and your focus
Data Mining vs. Data Exploration
Business intelligence view
Warehouse, data cube, reporting but not much mining
Business objects vs. data mining tools
Supply chain example: mining vs. OLAP vs. presentation tools
Data presentation vs. data exploration
Multi-Dimensional View of Data Mining
Data to be mined
Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks
Knowledge to be mined (or: Data mining functions)
Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.
Descriptive vs. predictive data mining
Multiple/integrated functions and mining at multiple levels
Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
The World-Wide Web
Data Mining Function: Generalization
Information integration and data warehouse construction
Data cleaning, transformation, integration, and multidimensional data model
Data cube technology
Scalable methods for computing (i.e., materializing) multidimensional aggregates
OLAP (online analytical processing)
Multidimensional concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region
Data Mining Function: Association and Correlation Analysis
Frequent patterns (or frequent itemsets)
What items are frequently purchased together in your Walmart?
Association, correlation vs. causality
A typical association rule
Diaper → Beer [0.5%, 75%] (support, confidence)
Are strongly associated items also strongly correlated?
How to mine such patterns and rules efficiently in large datasets?
How to use such patterns for classification, clustering, and other applications?
Data Mining Function: Classification
Classification and label prediction
Construct models (functions) based on some training examples
Describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on (gas mileage)