10 Pages: 12345678910



Agenda

  • Why Data Mining?
  • What Is Data Mining?
  • A Multi-Dimensional View of Data Mining
  • What Kinds of Data Can Be Mined?
  • What Kinds of Patterns Can Be Mined?
  • What Kinds of Technologies Are Used?
  • What Kinds of Applications Are Targeted?
  • Major Issues in Data Mining
  • A Brief History of Data Mining and Data Mining Society
  • Summary



Why Data Mining?

  • The Explosive Growth of Data: from terabytes to petabytes
    • Data collection and data availability
      • Automated data collection tools, database systems, Web, computerized society
    • Major sources of abundant data
      • Business: Web, e-commerce, transactions, stocks, …
      • Science: Remote sensing, bioinformatics, scientific simulation, …
      • Society and everyone: news, digital cameras, YouTube
  • We are drowning in data, but starving for knowledge!
  • “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets


Evolution of Sciences: New Data Science Era

  • Before 1600: Empirical science
  • 1600-1950s: Theoretical science
    • Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.
  • 1950s-1990s: Computational science
    • Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
    • Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.
  • 1990-now: Data science
    • The flood of data from new scientific instruments and simulations
    • The ability to economically store and manage petabytes of data online
    • The Internet and computing Grid that makes all these archives universally accessible
    • Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes
    • Data mining is a major new challenge!
  • Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002


What Is Data Mining?

  • Data mining (knowledge discovery from data)
    • Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
    • Data mining: a misnomer?
  • Alternative names
    • Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
  • Watch out: Is everything “data mining”?
    • Simple search and query processing
    • (Deductive) expert systems


Knowledge Discovery (KDD) Process

  • This is a view from typical database systems and data warehousing communities
  • Data mining plays an essential role in the knowledge discovery process



Example: A Web Mining Framework

  • Web mining usually involves
    • Data cleaning
    • Data integration from multiple sources
    • Warehousing the data
    • Data cube construction
    • Data selection for data mining
    • Data mining
    • Presentation of the mining results
    • Patterns and knowledge to be used or stored into knowledge-base


Data Mining in Business Intelligence




KDD Process: A Typical View from ML and Statistics

  • This is a view from typical machine learning and statistics communities


Which View Do You Prefer?

  • Which view do you prefer?
    • KDD vs. ML/Stat. vs. Business Intelligence
    • Depending on the data, applications, and your focus
  • Data Mining vs. Data Exploration
    • Business intelligence view
      • Warehouse, data cube, reporting but not much mining
    • Business objects vs. data mining tools
    • Supply chain example: mining vs. OLAP vs. presentation tools
    • Data presentation vs. data exploration


Multi-Dimensional View of Data Mining

  • Data to be mined
    • Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks
  • Knowledge to be mined (or: Data mining functions)
    • Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.
    • Descriptive vs. predictive data mining
    • Multiple/integrated functions and mining at multiple levels
  • Techniques utilized
    • Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.
  • Applications adapted
    • Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.


Data Mining: On What Kinds of Data?

  • Database-oriented data sets and applications
    • Relational database, data warehouse, transactional database
  • Advanced data sets and advanced applications
    • Data streams and sensor data
    • Time-series data, temporal data, sequence data (incl. bio-sequences)
    • Structure data, graphs, social networks and multi-linked data
    • Object-relational databases
    • Heterogeneous databases and legacy databases
    • Spatial data and spatiotemporal data
    • Multimedia database
    • Text databases
    • The World-Wide Web


Data Mining Function:  Generalization

  • Information integration and data warehouse construction
    • Data cleaning, transformation, integration, and multidimensional data model
  • Data cube technology
    • Scalable methods for computing (i.e., materializing) multidimensional aggregates
    • OLAP (online analytical processing)
  • Multidimensional concept description: Characterization and discrimination
    • Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region


Data Mining Function: Association and Correlation Analysis

  • Frequent patterns (or frequent itemsets)
    • What items are frequently purchased together in your Walmart?
  • Association, correlation vs. causality
    • A typical association rule
      • Diaper → Beer [0.5%, 75%] (support, confidence)
    • Are strongly associated items also strongly correlated?
  • How to mine such patterns and rules efficiently in large datasets?
  • How to use such patterns for classification, clustering, and other applications?


Data Mining Function: Classification

  • Classification and label prediction
    • Construct models (functions) based on some training examples
    • Describe and distinguish classes or concepts for future prediction
      • E.g., classify countries based on (climate), or classify cars based on (gas mileage)
    • Predict some unknown class labels
  • Typical methods
    • Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern-based classification, logistic regression, …
  • Typical applications:
    • Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, …


Data Mining Function: Cluster Analysis

  • Unsupervised learning (i.e., Class label is unknown)
  • Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns
  • Principle: Maximizing intra-class similarity & minimizing interclass similarity
  • Many methods and applications


Data Mining Function: Outlier Analysis

  • Outlier analysis
    • Outlier: A data object that does not comply with the general behavior of the data
    • Noise or exception? ― One person’s garbage could be another person’s treasure
    • Methods: by product of clustering or regression analysis, …
    • Useful in fraud detection, rare events analysis


Time and Ordering: Sequential Pattern, Trend and Evolution Analysis

  • Sequence, trend and evolution analysis
    • Trend, time-series, and deviation analysis: e.g., regression and value prediction
    • Sequential pattern mining
      • e.g., first buy digital camera, then buy large SD memory cards
    • Periodicity analysis
    • Motifs and biological sequence analysis
      • Approximate and consecutive motifs
    • Similarity-based analysis
  • Mining data streams
    • Ordered, time-varying, potentially infinite, data streams


Structure and Network Analysis

  • Graph mining
    • Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments)
  • Information network analysis
    • Social networks: actors (objects, nodes) and relationships (edges)
      • e.g., author networks in CS, terrorist networks
    • Multiple heterogeneous networks
      • A person could be multiple information networks: friends, family, classmates, …
    • Links carry a lot of semantic information: Link mining
  • Web mining
    • Web is a big information network: from PageRank to Google
    • Analysis of Web information networks
      • Web community discovery, opinion mining, usage mining, …


Evaluation of Knowledge

  • Are all mined knowledge interesting?
    • One can mine tremendous amount of “patterns” and knowledge
    • Some may fit only certain dimension space (time, location, …)
    • Some may not be representative, may be transient, …
  • Evaluation of mined knowledge → directly mine only interesting knowledge?
    • Descriptive vs. predictive
    • Coverage
    • Typicality vs. novelty
    • Accuracy
    • Timeliness


Data Mining: Confluence of Multiple Disciplines




Why Confluence of Multiple Disciplines?

  • Tremendous amount of data
    • Algorithms must be highly scalable to handle such as tera-bytes of data
  • High-dimensionality of data
    • Micro-array may have tens of thousands of dimensions
  • High complexity of data
    • Data streams and sensor data
    • Time-series data, temporal data, sequence data
    • Structure data, graphs, social networks and multi-linked data
    • Heterogeneous databases and legacy databases
    • Spatial, spatiotemporal, multimedia, text and Web data
    • Software programs, scientific simulations
  • New and sophisticated applications


Applications of Data Mining

  • Web page analysis: from web page classification, clustering to PageRank & HITS algorithms
  • Collaborative analysis & recommender systems
  • Basket data analysis to targeted marketing
  • Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis
  • Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue)
  • From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server Analysis Manager, Oracle Data Mining Tools) to invisible data mining


Major Issues in Data Mining

  • Mining Methodology
    • Mining various and new kinds of knowledge
    • Mining knowledge in multi-dimensional space
    • Data mining: An interdisciplinary effort
    • Boosting the power of discovery in a networked environment
    • Handling noise, uncertainty, and incompleteness of data
    • Pattern evaluation and pattern- or constraint-guided mining
  • User Interaction
    • Interactive mining
    • Incorporation of background knowledge
    • Presentation and visualization of data mining results


Major Issues in Data Mining (cont')

  • Efficiency and Scalability
    • Efficiency and scalability of data mining algorithms
    • Parallel, distributed, stream, and incremental mining methods
  • Diversity of data types
    • Handling complex types of data
    • Mining dynamic, networked, and global data repositories
  • Data mining and society
    • Social impacts of data mining
    • Privacy-preserving data mining
    • Invisible data mining


A Brief History of Data Mining Society

  • 1989 IJCAI Workshop on Knowledge Discovery in Databases
    • Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
  • 1991-1994 Workshops on Knowledge Discovery in Databases
    • Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
  • 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)
    • Journal of Data Mining and Knowledge Discovery (1997)
  • ACM SIGKDD conferences since 1998 and SIGKDD Explorations
  • More conferences on data mining
    • PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), WSDM (2008), etc.
  • ACM Transactions on KDD (2007)


Conferences and Journals on Data Mining

  • KDD Conferences
    • ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD)
    • SIAM Data Mining Conf. (SDM)
    • (IEEE) Int. Conf. on Data Mining (ICDM)
    • European Conf. on Machine Learning and Principles and practices of Knowledge Discovery and Data Mining (ECML-PKDD)
    • Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)
    • Int. Conf. on Web Search and Data Mining (WSDM)
  • Other related conferences
    • DB conferences: ACM SIGMOD, VLDB, ICDE, EDBT, ICDT, …
    • Web and IR conferences: WWW, SIGIR, WSDM
    • ML conferences: ICML, NIPS
    • PR conferences: CVPR,
  • Journals
    • Data Mining and Knowledge Discovery (DAMI or DMKD)
    • IEEE Trans. On Knowledge and Data Eng. (TKDE)
    • KDD Explorations
    • ACM Trans. on KDD


Where to Find References? DBLP, CiteSeer, Google

  • Data mining and KDD (SIGKDD: CDROM)
    • Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
    • Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
  • Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
    • Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
    • Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
  • AI & Machine Learning
    • Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
    • Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc.
  • Web and IR
    • Conferences: SIGIR, WWW, CIKM, etc.
    • Journals: WWW: Internet and Web Information Systems,
  • Statistics
    • Conferences: Joint Stat. Meeting, etc.
    • Journals: Annals of statistics, etc.
  • Visualization
    • Conference proceedings: CHI, ACM-SIGGraph, etc.
    • Journals: IEEE Trans. visualization and computer graphics, etc.


Recommended Reference Books

  • E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011
  • S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002
  • R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
  • T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
  • U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996
  • U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001
  • J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed. , 2011
  • T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer, 2009
  • B. Liu, Web Data Mining, Springer 2006
  • T. M. Mitchell, Machine Learning, McGraw Hill, 1997
  • Y. Sun and J. Han, Mining Heterogeneous Information Networks, Morgan & Claypool, 2012
  • P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
  • S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
  • I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005


Summary

  • Data mining: Discovering interesting patterns and knowledge from massive amount of data
  • A natural evolution of science and information technology, in great demand, with wide applications
  • A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation
  • Mining can be performed in a variety of data
  • Data mining functionalities: characterization, discrimination, association, classification, clustering, trend and outlier analysis, etc.
  • Data mining technologies and applications
  • Major issues in data mining




Agenda

  • Data Objects and Attribute Types
  • Basic Statistical Descriptions of Data
  • Data Visualization
  • Measuring Data Similarity and Dissimilarity
  • Summary 


Types of Data Sets

  • Record
    • Relational records
    • Data matrix, e.g., numerical matrix, crosstabs
    • Document data: text documents: term-frequency vector
    • Transaction data
  • Graph and network
    • World Wide Web
    • Social or information networks
    • Molecular Structures
  • Ordered
    • Video data: sequence of images
    • Temporal data: time-series
    • Sequential Data: transaction sequences
    • Genetic sequence data
  • Spatial, image and multimedia:
    • Spatial data: maps
    • Image data:
    • Video data:


Important Characteristics of Structured Data

  • Dimensionality
    • Curse of dimensionality
  • Sparsity
    • Only presence counts
  • Resolution
    • Patterns depend on the scale
  • Distribution
    • Centrality and dispersion


Data Objects

  • Data sets are made up of data objects.
  • A data object represents an entity.
  • Examples:
    • sales database: customers, store items, sales
    • medical database: patients, treatments
    • university database: students, professors, courses
  • Also called samples , examples, instances, data points, objects, tuples.
  • Data objects are described by attributes.
  • Database rows -> data objects; columns ->attributes.


Attributes

  • Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object.
    • E.g., customer _ID, name, address
  • Types:
    • Nominal
    • Binary
    • Numeric: quantitative
      • Interval-scaled
      • Ratio-scaled


Attribute Types

  • Nominal: categories, states, or “names of things”
    • Hair_color = {auburn, black, blond, brown, grey, red, white}
    • marital status, occupation, ID numbers, zip codes
  • Binary
    • Nominal attribute with only 2 states (0 and 1)
    • Symmetric binary: both outcomes equally important
      • e.g., gender
    • Asymmetric binary: outcomes not equally important.
      • e.g., medical test (positive vs. negative)
      • Convention: assign 1 to most important outcome (e.g., HIV positive)
  • Ordinal
    • Values have a meaningful order (ranking) but magnitude between successive values is not known.
    • Size = {small, medium, large}, grades, army rankings


Numeric Attribute Types

  • Quantity (integer or real-valued)
  • Interval
      • Measured on a scale of equal-sized units
      • Values have order
        • E.g., temperature in C˚or F˚, calendar dates
      • No true zero-point
  • Ratio
      • Inherent zero-point
      • We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚).
        • e.g., temperature in Kelvin, length, counts, monetary quantities


Discrete vs. Continuous Attributes

  • Discrete Attribute
    • Has only a finite or countably infinite set of values
      • E.g., zip codes, profession, or the set of words in a collection of documents
    • Sometimes, represented as integer variables
    • Note: Binary attributes are a special case of discrete attributes
  • Continuous Attribute
    • Has real numbers as attribute values
      • E.g., temperature, height, or weight
    • Practically, real values can only be measured and represented using a finite number of digits
    • Continuous attributes are typically represented as floating-point variables


Basic Statistical Descriptions of Data

  • Motivation
    • To better understand the data: central tendency, variation and spread
  • Data dispersion characteristics
    • median, max, min, quantiles, outliers, variance, etc.
  • Numerical dimensions correspond to sorted intervals
    • Data dispersion: analyzed with multiple granularities of precision
    • Boxplot or quantile analysis on sorted intervals
  • Dispersion analysis on computed measures
    • Folding measures into numerical dimensions
    • Boxplot or quantile analysis on the transformed cube


Measuring the Central Tendency

  •  Mean (algebraic measure) (sample vs. population):  

\[ {\bar{x}}=\frac{1}{n} \sum_{i=1}^{n}x_{i} \]
\[ \mu = \frac{\sum x}{N} \]

    • Note: n is sample size and N is population size.
    • Weighted arithmetic mean:
    • Trimmed mean: chopping extreme values

\[ {\bar{x}}=\frac{\sum_{i=1}^{n}w_{i}x_{i} }{\sum_{i=1}^{n}w_{i}}  \]

  • Median:
    • Middle value if odd number of values, or average of the middle two values otherwise
    • Estimated by interpolation (for grouped data ):  

\[ median = {L_{1}} + (\frac{\frac{n}{2}-(\sum freq)l)}{freq_{median}}) width \] 



Measuring the Central Tendency (cont')

  • Mode
    • Value that occurs most frequently in the data
    • Unimodal, bimodal, trimodal
    • Empirical formula:  
       \[ (mean-mode) = 3 \times (mean-median) \]



Symmetric vs. Skewed Data

  • Median, mean and mode of symmetric, positively and negatively skewed data 



Measuring the Dispersion of Data

  • Quartiles, outliers and boxplots
    • Quartiles: Q1 (25th percentile), Q3 (75th percentile)
    • Inter-quartile range: IQR = Q3 – Q1
    • Five number summary: min, Q1, median, Q3, max
    • Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually
    • Outlier: usually, a value higher/lower than 1.5 x IQR
  • Variance and standard deviation (sample: s, population: σ)
    • Variance: (algebraic, scalable computation)
\[ {s^{2}}=\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^2=\frac{1}{n-1}[\sum_{i=1}^{n}x_{i}^2-\frac{1}{n}(\sum_{i=1}^{n}x_{i})^2] \]

\[ {\sigma ^{2}}=\frac{1}{N}\sum_{i=1}^{n}(x_{i}-\mu)^2=\frac{1}{N}\sum_{i=1}^{n}x_{i}^2-\mu ^2 \]
    • Standard deviation s (or σ) is the square root of variance s2 (or σ2)  


Boxplot Analysis

  • Five-number summary of a distribution
    • Minimum, Q1, Median, Q3, Maximum
  • Boxplot
    • Data is represented with a box
    • The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
    • The median is marked by a line within the box
    • Whiskers: two lines outside the box extended to Minimum and Maximum
    • Outliers: points beyond a specified outlier threshold, plotted individually


Visualization of Data Dispersion: 3-D Boxplots



Properties of Normal Distribution Curve

  • The normal (distribution) curve
    • From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation)
    • From μ–2σ to μ+2σ: contains about 95% of it
    • From μ–3σ to μ+3σ: contains about 99.7% of it



Graphic Displays of Basic Statistical Descriptions

 
  • Boxplot: graphic display of five-number summary
  • Histogram: x-axis are values, y-axis repres. frequencies
  • Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are ≤ xi
  • Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another
  • Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane


Histogram Analysis

  • Histogram: Graph display of tabulated frequencies, shown as bars
  • It shows what proportion of cases fall into each of several categories
  • Differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a crucial distinction when the categories are not of uniform width
  • The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent



Histograms Often Tell More than Boxplots

  • The two histograms shown in the left may have the same boxplot representation
    • The same values for: min, Q1, median, Q3, max
  • But they have rather different data distributions



Quantile Plot

  • Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences)
  • Plots quantile information
    • For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi


Quantile-Quantile (Q-Q) Plot

  • Graphs the quantiles of one univariate distribution against the corresponding quantiles of another
  • View: Is there is a shift in going from one distribution to another?
  • Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit prices of items sold at Branch 1 tend to be lower than those at Branch 2.


Scatter plot

  • Provides a first look at bivariate data to see clusters of points, outliers, etc
  • Each pair of values is treated as a pair of coordinates and plotted as points in the plane


Positively and Negatively Correlated Data




Uncorrelated Data




Data Visualization

  • Why data visualization?
    • Gain insight into an information space by mapping data onto graphical primitives
    • Provide qualitative overview of large data sets
    • Search for patterns, trends, structure, irregularities, relationships among data
    • Help find interesting regions and suitable parameters for further quantitative analysis
    • Provide a visual proof of computer representations derived
  • Categorization of visualization methods:
    • Pixel-oriented visualization techniques
    • Geometric projection visualization techniques
    • Icon-based visualization techniques
    • Hierarchical visualization techniques
    • Visualizing complex data and relations


Pixel-Oriented Visualization Techniques

  • For a data set of m dimensions, create m windows on the screen, one for each dimension
  • The m dimension values of a record are mapped to m pixels at the corresponding positions in the windows
  • The colors of the pixels reflect the corresponding values



Laying Out Pixels in Circle Segments

  • To save space and show the connections among multiple dimensions, space filling is often done in a circle segment  


Geometric Projection Visualization Techniques

  • Visualization of geometric transformations and projections of the data
  • Methods
    • Direct visualization
    • Scatterplot and scatterplot matrices
    • Landscapes
    • Projection pursuit technique: Help users find meaningful projections of multidimensional data
    • Prosection views
    • Hyperslice
    • Parallel coordinates


Direct Data Visualization

  • Ribbons with Twists Based on Vorticity


Scatterplot Matrices

  • Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]



Landscapes

  • Visualization of the data as perspective landscape
  • The data needs to be transformed into a (possibly artificial) 2D spatial representation which preserves the characteristics of the data 


Parallel Coordinates

  • n equidistant axes which are parallel to one of the screen axes and correspond to the attributes
  • The axes are scaled to the [minimum, maximum]: range of the corresponding attribute
  • Every data item corresponds to a polygonal line which intersects each of the axes at the point which corresponds to the value for the attribute



Parallel Coordinates of a Data Set



Icon-Based Visualization Techniques

  • Visualization of the data values as features of icons
  • Typical visualization methods
    • Chernoff Faces
    • Stick Figures
  • General techniques
    • Shape coding: Use shape to represent certain information encoding
    • Color icons: Use color icons to encode more information
    • Tile bars: Use small icons to represent the relevant feature vectors in document retrieval


Chernoff Faces

  • A way to display variables on a two-dimensional surface, e.g., let x be eyebrow slant, y be eye size, z be nose length, etc.
  • The figure shows faces produced using 10 characteristics--head eccentricity, eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, mouth size, and mouth opening): Each assigned one of 10 possible values, generated using Mathematica (S. Dickson)
  • REFERENCE: Gonick, L. and Smith, W. The Cartoon Guide to Statistics. New York: Harper Perennial, p. 212, 1993
  • Weisstein, Eric W. "Chernoff Face." From MathWorld--A Wolfram Web Resource. mathworld.wolfram.com/ChernoffFace.html 



Stick Figure



Hierarchical Visualization Techniques

  • Visualization of the data using a hierarchical partitioning into subspaces
  • Methods
    • Dimensional Stacking
    • Worlds-within-Worlds
    • Tree-Map
    • Cone Trees
    • InfoCube


Dimensional Stacking


  • Partitioning of the n-dimensional attribute space in 2-D subspaces, which are ‘stacked’ into each other
  • Partitioning of the attribute value ranges into classes. The important attributes should be used on the outer levels.
  • Adequate for data with ordinal attributes of low cardinality
  • But, difficult to display more than nine dimensions
  • Important to map dimensions appropriately


Dimensional Stacking

Used by permission of M. Ward, Worcester Polytechnic Institute

  • Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes


Worlds-within-Worlds

  • Assign the function and two most important parameters to innermost world
  • Fix all other parameters at constant values - draw other (1 or 2 or 3 dimensional worlds choosing these as the axes)
  • Software that uses this paradigm
    • N–vision: Dynamic interaction through data glove and stereo displays, including rotation, scaling (inner) and translation (inner/outer) 
    • Auto Visual: Static interaction by means of queries


Tree-Map

  • Screen-filling method which uses a hierarchical partitioning of the screen into regions depending on the attribute values
  • The x- and y-dimension of the screen are partitioned alternately according to the attribute values (classes)



InfoCube

  • A 3-D visualization technique where hierarchical information is displayed as nested semi-transparent cubes
  • The outermost cubes correspond to the top level data, while the subnodes or the lower level data are represented as smaller cubes inside the outermost cubes, and so on


Three-D Cone Trees

   
 
  • 3D cone tree visualization technique works well for up to a thousand nodes or so
  • First build a 2D circle tree that arranges its nodes in concentric circles centered on the root node
  • Cannot avoid overlaps when projected to 2D
  • G. Robertson, J. Mackinlay, S. Card. “Cone Trees: Animated 3D Visualizations of Hierarchical Information”, ACM SIGCHI'91
  • Graph from Nadeau Software Consulting website: Visualize a social network data set that models the way an infection spreads from one person to the next


Visualizing Complex Data and Relations

  • Visualizing non-numerical data: text and social networks
  • Tag cloud: visualizing user-generated tags
    • The importance of tag is represented by font size/color
  • Besides text data, there are also methods to visualize relationships, such as visualizing social networks

Newsmap: Google News Stories in 2005


Similarity and Dissimilarity

  • Similarity
    • Numerical measure of how alike two data objects are
    • Value is higher when objects are more alike
    • Often falls in the range [0,1]
  • Dissimilarity (e.g., distance)
    • Numerical measure of how different two data objects are
    • Lower when objects are more alike
    • Minimum dissimilarity is often 0
    • Upper limit varies
  • Proximity refers to a similarity or dissimilarity


Data Matrix and Dissimilarity Matrix

   
 
  • Data matrix
    • n data points with p dimensions
    • Two modes
  • Dissimilarity matrix
    • n data points, but registers only the distance
    • A triangular matrix
    • Single mode


Proximity Measure for Nominal Attributes

  • Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary attribute)
  • Method 1: Simple matching
    • m : # of matches, p : total # of variables

\[ d(i,j)=\frac{p-m}{p} \]

  • Method 2: Use a large number of binary attributes
    • creating a new binary attribute for each of the M nominal states


Proximity Measure for Binary Attributes

  • A contingency table for binary data

  • Distance measure for symmetric binary variables: 
\[ d(i,j)=\frac{r+s}{q+r+s+t} \]
  • Distance measure for asymmetric binary variables: 

\[ d(i,j)=\frac{r+s}{q+r+s} \]

  • Jaccard coefficient (similarity measure for asymmetric binary variables):

\[ sim_{Jaccard}(i,j)=\frac{q}{q+r+s} \]

  • Note: Jaccard coefficient is the same as “coherence”:

\[ coherence(i,j)=\frac{sup(i,j)}{sup(i)+sup(j)-sup(i,j)}=\frac{q}{(q+r)(q+s)-q} \]
 


Dissimilarity between Binary Variables

  • Example

    • Gender is a symmetric attribute
    • The remaining attributes are asymmetric binary
    • Let the values Y and P be 1, and the value N 0



Standardizing Numeric Data

  • Z-score: 

\[ z=\frac{x-\mu}{\sigma } \]

    • X: raw score to be standardized, μ: mean of the population, σ: standard deviation
    • the distance between the raw score and the population mean in units of the standard deviation
    • negative when the raw score is below the mean, “+” when above
  • An alternative way: Calculate the mean absolute deviation, where

\[ m_{f}= \frac{1}{n}(x_{1f}+x_{2f}+...+x_{nf}) \]

    • standardized measure (z-score):

\[ z_{if}=\frac{(x_{if}-m_{f})}{S_{f}} \]

  • Using mean absolute deviation is more robust than using standard deviation 
\[ s_{f}=\frac{1}{n} (|x_{1f}-m_{f}|+|x_{2f}-m_{f}|+...+|x_{nf}-m_{f}|) \]


Example: Data Matrix and Dissimilarity Matrix




Distance on Numeric Data: Minkowski Distance

  • Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the order (the distance so defined is also called L-h norm)

  • Properties
    • d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
    • d(i, j) = d(j, i) (Symmetry)
    • d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality)
  • A distance that satisfies these properties is a metric


Special Cases of Minkowski Distance

  • h = 1: Manhattan (city block, L1 norm) distance
    • E.g., the Hamming distance: the number of bits that are different between two binary vectors
\[ d(i,j)=|x_{i1}-x_{j1}|+|x_{i2}-x_{j2}|+...+|x_{ip}-x_{jp}| \]

  • h = 2: (L2 norm) Euclidean distance

\[ d(i,j)=\sqrt{(|x_{i1}-x_{j1}|^2+|x_{i2}-x_{j2}|^2+...+|x_{ip}-x_{jp}|^2)} \]

  • h →≈ . “supremum” (Lmax norm, L norm) distance.
    • This is the maximum difference between any component (attribute) of the vectors
\[ d(i,j)=\lim_{h\rightarrow \infty }(\sum_{f=1}^{p}|x_{if}-x_{jf}|^{h})^\frac{1}{h} =max_{f}^{p}|x_{if}-x_{jf}| \]


Example: Minkowski Distance