Agenda

  • Why Data Mining?
  • What Is Data Mining?
  • A Multi-Dimensional View of Data Mining
  • What Kinds of Data Can Be Mined?
  • What Kinds of Patterns Can Be Mined?
  • What Kinds of Technologies Are Used?
  • What Kinds of Applications Are Targeted?
  • Major Issues in Data Mining
  • A Brief History of Data Mining and Data Mining Society
  • Summary



Why Data Mining?

  • The Explosive Growth of Data: from terabytes to petabytes
    • Data collection and data availability
      • Automated data collection tools, database systems, Web, computerized society
    • Major sources of abundant data
      • Business: Web, e-commerce, transactions, stocks, …
      • Science: Remote sensing, bioinformatics, scientific simulation, …
      • Society and everyone: news, digital cameras, YouTube
  • We are drowning in data, but starving for knowledge!
  • “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets


Evolution of Sciences: New Data Science Era

  • Before 1600: Empirical science
  • 1600-1950s: Theoretical science
    • Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.
  • 1950s-1990s: Computational science
    • Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
    • Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.
  • 1990-now: Data science
    • The flood of data from new scientific instruments and simulations
    • The ability to economically store and manage petabytes of data online
    • The Internet and computing Grid that makes all these archives universally accessible
    • Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes
    • Data mining is a major new challenge!
  • Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002


What Is Data Mining?

  • Data mining (knowledge discovery from data)
    • Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
    • Data mining: a misnomer?
  • Alternative names
    • Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
  • Watch out: Is everything “data mining”?
    • Simple search and query processing
    • (Deductive) expert systems


Knowledge Discovery (KDD) Process

  • This is a view from typical database systems and data warehousing communities
  • Data mining plays an essential role in the knowledge discovery process



Example: A Web Mining Framework

  • Web mining usually involves
    • Data cleaning
    • Data integration from multiple sources
    • Warehousing the data
    • Data cube construction
    • Data selection for data mining
    • Data mining
    • Presentation of the mining results
    • Patterns and knowledge to be used or stored into knowledge-base


Data Mining in Business Intelligence




KDD Process: A Typical View from ML and Statistics

  • This is a view from typical machine learning and statistics communities


Which View Do You Prefer?

  • Which view do you prefer?
    • KDD vs. ML/Stat. vs. Business Intelligence
    • Depending on the data, applications, and your focus
  • Data Mining vs. Data Exploration
    • Business intelligence view
      • Warehouse, data cube, reporting but not much mining
    • Business objects vs. data mining tools
    • Supply chain example: mining vs. OLAP vs. presentation tools
    • Data presentation vs. data exploration


Multi-Dimensional View of Data Mining

  • Data to be mined
    • Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks
  • Knowledge to be mined (or: Data mining functions)
    • Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.
    • Descriptive vs. predictive data mining
    • Multiple/integrated functions and mining at multiple levels
  • Techniques utilized
    • Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.
  • Applications adapted
    • Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.


Data Mining: On What Kinds of Data?

  • Database-oriented data sets and applications
    • Relational database, data warehouse, transactional database
  • Advanced data sets and advanced applications
    • Data streams and sensor data
    • Time-series data, temporal data, sequence data (incl. bio-sequences)
    • Structure data, graphs, social networks and multi-linked data
    • Object-relational databases
    • Heterogeneous databases and legacy databases
    • Spatial data and spatiotemporal data
    • Multimedia database
    • Text databases
    • The World-Wide Web


Data Mining Function:  Generalization

  • Information integration and data warehouse construction
    • Data cleaning, transformation, integration, and multidimensional data model
  • Data cube technology
    • Scalable methods for computing (i.e., materializing) multidimensional aggregates
    • OLAP (online analytical processing)
  • Multidimensional concept description: Characterization and discrimination
    • Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region


Data Mining Function: Association and Correlation Analysis

  • Frequent patterns (or frequent itemsets)
    • What items are frequently purchased together in your Walmart?
  • Association, correlation vs. causality
    • A typical association rule
      • Diaper → Beer [0.5%, 75%] (support, confidence)
    • Are strongly associated items also strongly correlated?
  • How to mine such patterns and rules efficiently in large datasets?
  • How to use such patterns for classification, clustering, and other applications?


Data Mining Function: Classification

  • Classification and label prediction
    • Construct models (functions) based on some training examples
    • Describe and distinguish classes or concepts for future prediction
      • E.g., classify countries based on (climate), or classify cars based on (gas mileage)
    • Predict some unknown class labels
  • Typical methods
    • Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern-based classification, logistic regression, …
  • Typical applications:
    • Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, …


Data Mining Function: Cluster Analysis

  • Unsupervised learning (i.e., Class label is unknown)
  • Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns
  • Principle: Maximizing intra-class similarity & minimizing interclass similarity
  • Many methods and applications


Data Mining Function: Outlier Analysis

  • Outlier analysis
    • Outlier: A data object that does not comply with the general behavior of the data
    • Noise or exception? ― One person’s garbage could be another person’s treasure
    • Methods: by product of clustering or regression analysis, …
    • Useful in fraud detection, rare events analysis


Time and Ordering: Sequential Pattern, Trend and Evolution Analysis

  • Sequence, trend and evolution analysis
    • Trend, time-series, and deviation analysis: e.g., regression and value prediction
    • Sequential pattern mining
      • e.g., first buy digital camera, then buy large SD memory cards
    • Periodicity analysis
    • Motifs and biological sequence analysis
      • Approximate and consecutive motifs
    • Similarity-based analysis
  • Mining data streams
    • Ordered, time-varying, potentially infinite, data streams


Structure and Network Analysis

  • Graph mining
    • Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments)
  • Information network analysis
    • Social networks: actors (objects, nodes) and relationships (edges)
      • e.g., author networks in CS, terrorist networks
    • Multiple heterogeneous networks
      • A person could be multiple information networks: friends, family, classmates, …
    • Links carry a lot of semantic information: Link mining
  • Web mining
    • Web is a big information network: from PageRank to Google
    • Analysis of Web information networks
      • Web community discovery, opinion mining, usage mining, …


Evaluation of Knowledge

  • Are all mined knowledge interesting?
    • One can mine tremendous amount of “patterns” and knowledge
    • Some may fit only certain dimension space (time, location, …)
    • Some may not be representative, may be transient, …
  • Evaluation of mined knowledge → directly mine only interesting knowledge?
    • Descriptive vs. predictive
    • Coverage
    • Typicality vs. novelty
    • Accuracy
    • Timeliness


Data Mining: Confluence of Multiple Disciplines




Why Confluence of Multiple Disciplines?

  • Tremendous amount of data
    • Algorithms must be highly scalable to handle such as tera-bytes of data
  • High-dimensionality of data
    • Micro-array may have tens of thousands of dimensions
  • High complexity of data
    • Data streams and sensor data
    • Time-series data, temporal data, sequence data
    • Structure data, graphs, social networks and multi-linked data
    • Heterogeneous databases and legacy databases
    • Spatial, spatiotemporal, multimedia, text and Web data
    • Software programs, scientific simulations
  • New and sophisticated applications


Applications of Data Mining

  • Web page analysis: from web page classification, clustering to PageRank & HITS algorithms
  • Collaborative analysis & recommender systems
  • Basket data analysis to targeted marketing
  • Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis
  • Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue)
  • From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server Analysis Manager, Oracle Data Mining Tools) to invisible data mining


Major Issues in Data Mining

  • Mining Methodology
    • Mining various and new kinds of knowledge
    • Mining knowledge in multi-dimensional space
    • Data mining: An interdisciplinary effort
    • Boosting the power of discovery in a networked environment
    • Handling noise, uncertainty, and incompleteness of data
    • Pattern evaluation and pattern- or constraint-guided mining
  • User Interaction
    • Interactive mining
    • Incorporation of background knowledge
    • Presentation and visualization of data mining results


Major Issues in Data Mining (cont')

  • Efficiency and Scalability
    • Efficiency and scalability of data mining algorithms
    • Parallel, distributed, stream, and incremental mining methods
  • Diversity of data types
    • Handling complex types of data
    • Mining dynamic, networked, and global data repositories
  • Data mining and society
    • Social impacts of data mining
    • Privacy-preserving data mining
    • Invisible data mining


A Brief History of Data Mining Society

  • 1989 IJCAI Workshop on Knowledge Discovery in Databases
    • Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
  • 1991-1994 Workshops on Knowledge Discovery in Databases
    • Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
  • 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)
    • Journal of Data Mining and Knowledge Discovery (1997)
  • ACM SIGKDD conferences since 1998 and SIGKDD Explorations
  • More conferences on data mining
    • PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), WSDM (2008), etc.
  • ACM Transactions on KDD (2007)


Conferences and Journals on Data Mining

  • KDD Conferences
    • ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD)
    • SIAM Data Mining Conf. (SDM)
    • (IEEE) Int. Conf. on Data Mining (ICDM)
    • European Conf. on Machine Learning and Principles and practices of Knowledge Discovery and Data Mining (ECML-PKDD)
    • Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)
    • Int. Conf. on Web Search and Data Mining (WSDM)
  • Other related conferences
    • DB conferences: ACM SIGMOD, VLDB, ICDE, EDBT, ICDT, …
    • Web and IR conferences: WWW, SIGIR, WSDM
    • ML conferences: ICML, NIPS
    • PR conferences: CVPR,
  • Journals
    • Data Mining and Knowledge Discovery (DAMI or DMKD)
    • IEEE Trans. On Knowledge and Data Eng. (TKDE)
    • KDD Explorations
    • ACM Trans. on KDD


Where to Find References? DBLP, CiteSeer, Google

  • Data mining and KDD (SIGKDD: CDROM)
    • Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
    • Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
  • Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
    • Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
    • Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
  • AI & Machine Learning
    • Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
    • Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc.
  • Web and IR
    • Conferences: SIGIR, WWW, CIKM, etc.
    • Journals: WWW: Internet and Web Information Systems,
  • Statistics
    • Conferences: Joint Stat. Meeting, etc.
    • Journals: Annals of statistics, etc.
  • Visualization
    • Conference proceedings: CHI, ACM-SIGGraph, etc.
    • Journals: IEEE Trans. visualization and computer graphics, etc.


Recommended Reference Books

  • E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011
  • S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002
  • R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
  • T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
  • U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996
  • U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001
  • J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed. , 2011
  • T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer, 2009
  • B. Liu, Web Data Mining, Springer 2006
  • T. M. Mitchell, Machine Learning, McGraw Hill, 1997
  • Y. Sun and J. Han, Mining Heterogeneous Information Networks, Morgan & Claypool, 2012
  • P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
  • S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
  • I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005


Summary

  • Data mining: Discovering interesting patterns and knowledge from massive amount of data
  • A natural evolution of science and information technology, in great demand, with wide applications
  • A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation
  • Mining can be performed in a variety of data
  • Data mining functionalities: characterization, discrimination, association, classification, clustering, trend and outlier analysis, etc.
  • Data mining technologies and applications
  • Major issues in data mining




Creator: sidraaslam

Contributors:
rabinkumar


Licensed under the Creative Commons
Attribution ShareAlike CC-BY-SA license


This deck was created using SlideWiki.