Agenda
 Data Objects and Attribute Types
 Basic Statistical Descriptions of Data
 Data Visualization
 Measuring Data Similarity and Dissimilarity
 Summary
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix, crosstabs
 Document data: text documents: termfrequency vector
 Transaction data
 Graph and network
 World Wide Web
 Social or information networks
 Molecular Structures
 Ordered
 Video data: sequence of images
 Temporal data: timeseries
 Sequential Data: transaction sequences
 Genetic sequence data
 Spatial, image and multimedia:
 Spatial data: maps
 Image data:
 Video data:
Important Characteristics of Structured Data
 Dimensionality
 Curse of dimensionality
 Sparsity
 Only presence counts
 Resolution
 Patterns depend on the scale
 Distribution
 Centrality and dispersion
Data Objects
 Data sets are made up of data objects.
 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points, objects, tuples.
 Data objects are described by attributes.
 Database rows > data objects; columns >attributes.
Attributes
 Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object.
 E.g., customer _ID, name, address
 Types:
 Nominal
 Binary
 Numeric: quantitative
 Intervalscaled
 Ratioscaled
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between successive values is not known.
 Size = {small, medium, large}, grades, army rankings
Numeric Attribute Types
 Quantity (integer or realvalued)
 Interval
 Measured on a scale of equalsized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zeropoint
 Ratio
 Inherent zeropoint
 We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts, monetary quantities
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values
 E.g., zip codes, profession, or the set of words in a collection of documents
 Sometimes, represented as integer variables
 Note: Binary attributes are a special case of discrete attributes
 Continuous Attribute
 Has real numbers as attribute values
 E.g., temperature, height, or weight
 Practically, real values can only be measured and represented using a finite number of digits
 Continuous attributes are typically represented as floatingpoint variables
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency, variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population):
\[ {\bar{x}}=\frac{1}{n} \sum_{i=1}^{n}x_{i} \]
\[ \mu = \frac{\sum x}{N} \]
 Note: n is sample size and N is population size.
 Weighted arithmetic mean:
 Trimmed mean: chopping extreme values
\[ {\bar{x}}=\frac{\sum_{i=1}^{n}w_{i}x_{i} }{\sum_{i=1}^{n}w_{i}} \]
 Median:
 Middle value if odd number of values, or average of the middle two values otherwise
 Estimated by interpolation (for grouped data ):
\[ median = {L_{1}} + (\frac{\frac{n}{2}(\sum freq)l)}{freq_{median}}) width \]
Measuring the Central Tendency (cont')
 Mode
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal

Empirical formula:
\[ (meanmode) = 3 \times (meanmedian) \]
Symmetric vs. Skewed Data
 Median, mean and mode of symmetric, positively and negatively skewed data
Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Interquartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
 Standard deviation s (or σ) is the square root of variance s2 (or σ2)
Boxplot Analysis
 Fivenumber summary of a distribution
 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the box
 Whiskers: two lines outside the box extended to Minimum and Maximum
 Outliers: points beyond a specified outlier threshold, plotted individually
Visualization of Data Dispersion: 3D Boxplots
Properties of Normal Distribution Curve
 The normal (distribution) curve
 From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation)
 From μ–2σ to μ+2σ: contains about 95% of it
 From μ–3σ to μ+3σ: contains about 99.7% of it
Graphic Displays of Basic Statistical Descriptions
 Boxplot: graphic display of fivenumber summary
 Histogram: xaxis are values, yaxis repres. frequencies
 Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are ≤ xi
 Quantilequantile (qq) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another
 Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane
Histogram Analysis
 Histogram: Graph display of tabulated frequencies, shown as bars
 It shows what proportion of cases fall into each of several categories
 Differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a crucial distinction when the categories are not of uniform width
 The categories are usually specified as nonoverlapping intervals of some variable. The categories (bars) must be adjacent
Histograms Often Tell More than Boxplots

Quantile Plot
 Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences)
 Plots quantile information
 For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi
QuantileQuantile (QQ) Plot
 Graphs the quantiles of one univariate distribution against the corresponding quantiles of another
 View: Is there is a shift in going from one distribution to another?
 Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit prices of items sold at Branch 1 tend to be lower than those at Branch 2.
Scatter plot
 Provides a first look at bivariate data to see clusters of points, outliers, etc
 Each pair of values is treated as a pair of coordinates and plotted as points in the plane
Positively and Negatively Correlated Data
Uncorrelated Data
Data Visualization
 Why data visualization?
 Gain insight into an information space by mapping data onto graphical primitives
 Provide qualitative overview of large data sets
 Search for patterns, trends, structure, irregularities, relationships among data
 Help find interesting regions and suitable parameters for further quantitative analysis
 Provide a visual proof of computer representations derived
 Categorization of visualization methods:
 Pixeloriented visualization techniques
 Geometric projection visualization techniques
 Iconbased visualization techniques
 Hierarchical visualization techniques
 Visualizing complex data and relations
PixelOriented Visualization Techniques
 For a data set of m dimensions, create m windows on the screen, one for each dimension
 The m dimension values of a record are mapped to m pixels at the corresponding positions in the windows
 The colors of the pixels reflect the corresponding values
Laying Out Pixels in Circle Segments
 To save space and show the connections among multiple dimensions, space filling is often done in a circle segment
Geometric Projection Visualization Techniques
 Visualization of geometric transformations and projections of the data
 Methods
 Direct visualization
 Scatterplot and scatterplot matrices
 Landscapes
 Projection pursuit technique: Help users find meaningful projections of multidimensional data
 Prosection views
 Hyperslice
 Parallel coordinates
Direct Data Visualization
 Ribbons with Twists Based on Vorticity
Scatterplot Matrices
 Matrix of scatterplots (xydiagrams) of the kdim. data [total of (k2/2k) scatterplots]
Landscapes
 Visualization of the data as perspective landscape
 The data needs to be transformed into a (possibly artificial) 2D spatial representation which preserves the characteristics of the data
Parallel Coordinates
 n equidistant axes which are parallel to one of the screen axes and correspond to the attributes
 The axes are scaled to the [minimum, maximum]: range of the corresponding attribute
 Every data item corresponds to a polygonal line which intersects each of the axes at the point which corresponds to the value for the attribute
Parallel Coordinates of a Data Set
IconBased Visualization Techniques
 Visualization of the data values as features of icons
 Typical visualization methods
 Chernoff Faces
 Stick Figures
 General techniques
 Shape coding: Use shape to represent certain information encoding
 Color icons: Use color icons to encode more information
 Tile bars: Use small icons to represent the relevant feature vectors in document retrieval
Chernoff Faces
 A way to display variables on a twodimensional surface, e.g., let x be eyebrow slant, y be eye size, z be nose length, etc.
 The figure shows faces produced using 10 characteristicshead eccentricity, eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, mouth size, and mouth opening): Each assigned one of 10 possible values, generated using Mathematica (S. Dickson)
 REFERENCE: Gonick, L. and Smith, W. The Cartoon Guide to Statistics. New York: Harper Perennial, p. 212, 1993
 Weisstein, Eric W. "Chernoff Face." From MathWorldA Wolfram Web Resource. mathworld.wolfram.com/ChernoffFace.html
Stick Figure
Hierarchical Visualization Techniques
 Visualization of the data using a hierarchical partitioning into subspaces
 Methods
 Dimensional Stacking
 WorldswithinWorlds
 TreeMap
 Cone Trees
 InfoCube
Dimensional Stacking
 Partitioning of the ndimensional attribute space in 2D subspaces, which are ‘stacked’ into each other
 Partitioning of the attribute value ranges into classes. The important attributes should be used on the outer levels.
 Adequate for data with ordinal attributes of low cardinality
 But, difficult to display more than nine dimensions
 Important to map dimensions appropriately
Dimensional Stacking
Used by permission of M. Ward, Worcester Polytechnic Institute
 Visualization of oil mining data with longitude and latitude mapped to the outer x, yaxes and ore grade and depth mapped to the inner x, yaxes
WorldswithinWorlds
 Assign the function and two most important parameters to innermost world
 Fix all other parameters at constant values  draw other (1 or 2 or 3 dimensional worlds choosing these as the axes)
 Software that uses this paradigm
 N–vision: Dynamic interaction through data glove and stereo displays, including rotation, scaling (inner) and translation (inner/outer)
 Auto Visual: Static interaction by means of queries
TreeMap
 Screenfilling method which uses a hierarchical partitioning of the screen into regions depending on the attribute values
 The x and ydimension of the screen are partitioned alternately according to the attribute values (classes)
InfoCube
 A 3D visualization technique where hierarchical information is displayed as nested semitransparent cubes
 The outermost cubes correspond to the top level data, while the subnodes or the lower level data are represented as smaller cubes inside the outermost cubes, and so on
ThreeD Cone Trees




Visualizing Complex Data and Relations
 Visualizing nonnumerical data: text and social networks
 Tag cloud: visualizing usergenerated tags
 The importance of tag is represented by font size/color
 Besides text data, there are also methods to visualize relationships, such as visualizing social networks
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
 Dissimilarity (e.g., distance)
 Numerical measure of how different two data objects are
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies
 Proximity refers to a similarity or dissimilarity
Data Matrix and Dissimilarity Matrix




Proximity Measure for Nominal Attributes
 Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary attribute)
 Method 1: Simple matching
 m : # of matches, p : total # of variables
\[ d(i,j)=\frac{pm}{p} \]
 Method 2: Use a large number of binary attributes
 creating a new binary attribute for each of the M nominal states
Proximity Measure for Binary Attributes
 A contingency table for binary data
 Distance measure for symmetric binary variables:
 Distance measure for asymmetric binary variables:
\[ d(i,j)=\frac{r+s}{q+r+s} \]
 Jaccard coefficient (similarity measure for asymmetric binary variables):
\[ sim_{Jaccard}(i,j)=\frac{q}{q+r+s} \]
 Note: Jaccard coefficient is the same as “coherence”:
Dissimilarity between Binary Variables
 Example
 Gender is a symmetric attribute
 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N 0
Standardizing Numeric Data
 Zscore:
\[ z=\frac{x\mu}{\sigma } \]
 X: raw score to be standardized, μ: mean of the population, σ: standard deviation
 the distance between the raw score and the population mean in units of the standard deviation
 negative when the raw score is below the mean, “+” when above
 An alternative way: Calculate the mean absolute deviation, where
\[ m_{f}= \frac{1}{n}(x_{1f}+x_{2f}+...+x_{nf}) \]
 standardized measure (zscore):
\[ z_{if}=\frac{(x_{if}m_{f})}{S_{f}} \]
 Using mean absolute deviation is more robust than using standard deviation
Example: Data Matrix and Dissimilarity Matrix
Distance on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and h is the order (the distance so defined is also called Lh norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
Special Cases of Minkowski Distance
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are different between two binary vectors
 h = 2: (L2 norm) Euclidean distance
\[ d(i,j)=\sqrt{(x_{i1}x_{j1}^2+x_{i2}x_{j2}^2+...+x_{ip}x_{jp}^2)} \]
 h →≈ . “supremum” (Lmax norm, L≈ norm) distance.
 This is the maximum difference between any component (attribute) of the vectors
Example: Minkowski Distance
Ordinal Variables
 An ordinal variable can be discrete or continuous
 Order is important, e.g., rank
 Can be treated like intervalscaled
 replace xif by their rank
\[ r_{if} \epsilon \left \{ 1,...,M_{f} \right \} \]
 map the range of each variable onto [0, 1] by replacing ith object in the fth variable by
 compute the dissimilarity using methods for intervalscaled variables
Attributes of Mixed Type
 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary, numeric, ordinal
 One may use a weighted formula to combine their effects
\[ d(i,j) = \frac{\sum_{f=1}^{p} \delta _{ij}^{(f)} d_{ij}^{(f)}}{\sum_{f=1}^{p} \delta _{ij}^{(f)}} \]
 f is binary or nominal:
 dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is numeric: use the normalized distance
 f is ordinal
 Compute ranks rif and
 Treat zif as intervalscaled
Cosine Similarity
 A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document.
 Other vector objects: gene features in microarrays, …
 Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., termfrequency vectors), then
cos(d1, d2) = (d1 ⋅ d2) /d1 d2 ,
where ⋅ indicates vector dot product, d: the length of vector d
Example: Cosine Similarity
 cos(d1, d2) = (d1 ⋅ d2) / d1 d2 ,
where ⋅ indicates vector dot product, d: the length of vector d  Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1 ⋅ d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
d1 = (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
d2 = (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5 = (17) 0.5 = 4.12
cos(d1, d2 ) = 0.94
Summary
 Data attribute types: nominal, binary, ordinal, intervalscaled, ratioscaled
 Many types of data sets, e.g., numerical, text, graph, Web, image.
 Gain insight into the data by:
 Basic statistical data description: central tendency, dispersion, graphical displays
 Data visualization: map data onto graphical primitives
 Measure data similarity
 Above steps are the beginning of data preprocessing.
 Many methods have been developed but still an active area of research.
References
 W. Cleveland, Visualizing Data, Hobart Press, 1993
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
 U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.
 H. V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech. Committee on Data Eng., 20(4), Dec. 1997
 D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization and Computer Graphics, 8(1), 2002
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 21(9), 1999
 E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press, 2001
 C. Yu et al., Visual data mining of multimedia data for social and behavioral studies, Information Visualization, 8(1), 2009