115411">
Agenda

Cluster Analysis: Basic Concepts

Partitioning Methods

Hierarchical Methods

DensityBased Methods

GridBased Methods

Evaluation of Clustering

Summary
What is Cluster Analysis?

Cluster: A collection of data objects

similar (or related) to one another within the same group

dissimilar (or unrelated) to the objects in other groups

Cluster analysis (or
clustering
,
data segmentation, …
)

Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters

Unsupervised learning: no predefined classes (i.e.,
learning by observations
vs. learning by examples: supervised)

Typical applications

As a standalone tool to get insight into data distribution

As a preprocessing step for other algorithms
Applications of Cluster Analysis

Data reduction

Summarization: Preprocessing for regression, PCA, classification, and association analysis

Compression: Image processing: vector quantization

Hypothesis generation and testing

Prediction based on groups

Cluster & find characteristics/patterns for each group

Finding Knearest Neighbors

Localizing search to one or a small number of clusters

Outlier detection: Outliers are often viewed as those “far away” from any cluster
Clustering: Application Examples

Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species

Information retrieval: document clustering

Land use: Identification of areas of similar land use in an earth observation database

Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs

Cityplanning: Identifying groups of houses according to their house type, value, and geographical location

Earthquake studies: Observed earth quake epicenters should be clustered along continent faults

Climate: understanding earth climate, find patterns of atmospheric and ocean

Economic Science: market resarch
Basic Steps to Develop a Clustering Task

Feature selection

Select info concerning the task of interest

Minimal information redundancy

Proximity measure

Similarity of two feature vectors

Clustering criterion

Expressed via a cost function or some rules

Clustering algorithms

Validation of the results

Validation test (also,
clustering tendency
test)

Interpretation of the results

Integration with applications
Quality: What Is Good Clustering?

A good clustering method will produce high quality clusters

high intraclass similarity: cohesive within clusters

low interclass similarity: distinctive between clusters

The quality of a clustering method depends on

the similarity measure used by the method

its implementation, and

Its ability to discover some or all of the hidden patterns
Measure the Quality of Clustering
 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function, typically metric: d(i, j)
 The definitions of distance functions are usually rather different for intervalscaled, boolean, categorical, ordinal ratio, and vector variables
 Weights should be associated with different variables based on applications and data semantics
 Quality of clustering:
 There is usually a separate “quality” function that measures the “goodness” of a cluster.
 It is hard to define “similar enough” or “good enough”
 The answer is typically highly subjective
Considerations for Cluster Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multilevel hierarchical partitioning is desirable)
 Separation of clusters
 Exclusive (e.g., one customer belongs to only one region) vs. nonexclusive (e.g., one document may belong to more than one class)
 Similarity measure
 Distancebased (e.g., Euclidian, road network, vector) vs. connectivitybased (e.g., density or contiguity)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in highdimensional clustering)
Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples
 Ability to deal with different types of attributes
 Numerical, binary, categorical, ordinal, linked, and mixture of these
 Constraintbased clustering
 User may give inputs on constraints
 Use domain knowledge to determine input parameters
 Interpretability and usability
 Others
 Discovery of clusters with arbitrary shape
 Ability to deal with noisy data
 Incremental clustering and insensitivity to input order
 High dimensionality
Major Clustering Approaches
 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors
 Typical methods: kmeans, kmedoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects) using some criterion
 Typical methods: Diana, Agnes, BIRCH, CAMELEON
 Densitybased approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue
 Gridbased approach:
 based on a multiplelevel granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
Major Clustering Approaches (cont')
 Modelbased:
 A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent patternbased:
 Based on the analysis of frequent patterns
 Typical methods: pCluster
 Userguided or constraintbased:
 Clustering by considering userspecified or applicationspecific constraints
 Typical methods: COD (obstacles), constrained clustering
 Linkbased clustering:
 Objects are often linked together in various ways
 Massive links can be used to cluster objects: SimRank, LinkClus
Partitioning Algorithms: Basic Concept
 Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci)
\[E=\sum_{i=1}^{k}\sum_{p\epsilon C_{i}}(d(p,c_{i}))^{2}\]
 Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: kmeans and kmedoids algorithms
 kmeans (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster
 kmedoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster
The KMeans Clustering Method
 Given k, the kmeans algorithm is implemented in four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest seed point
 Go back to Step 2, stop when the assignment does not change
An Example of KMeans Clustering
Comments on the KMeans Method
 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.
\[PAM: O(k(nk)^{2}), CLARA: O(ks^{2} + k(nk))\]
 Comment: Often terminates at a local optimal
 Weakness
 Applicable only to objects in a continuous ndimensional space
 Using the kmodes method for categorical data
 In comparison, kmedoids can be applied to a wide range of data
 Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009)
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with nonconvex shapes
Variations of the KMeans Method
 Most of the variants of the kmeans which differ in
 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: kmodes
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequencybased method to update modes of clusters
 A mixture of categorical and numerical data: kprototype method
What Is the Problem of the KMeans Method?
 The kmeans algorithm is sensitive to outliers !
 Since an object with an extremely large value may substantially distort the distribution of the data
 KMedoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster
The KMedoid Clustering Method
 KMedoids Clustering: Find representative objects (medoids) in clusters
 PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
 Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the nonmedoids if it improves the total distance of the resulting clustering
 PAM works effectively for small data sets, but does not scale well for large data sets (due to the computational complexity)
 Efficiency improvement on PAM
 CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
 CLARANS (Ng & Han, 1994): Randomized resampling
PAM (Partitioning Around Medoids) (1987)
 PAM (Kaufman and Rousseeuw, 1987), built in Splus
 Use real object to represent the cluster
 Select k representative objects arbitrarily
 For each pair of nonselected object h and selected object i, calculate the total swapping cost TCih
 For each pair of i and h,
 If TCih < 0, i is replaced by h
 Then assign each nonselected object to the most similar representative object
 repeat steps 23 until there is no change
PAM: A Typical KMedoids Algorithm
PAM Clustering: Finding the Best Cluster Center
 Case 1: p currently belongs to oj. If oj is replaced by orandom as a representative object and p is the closest to one of the other representative object oi, then p is reassigned to oi
What Is the Problem with PAM?
 Pam is more robust than kmeans in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean
 Pam works efficiently for small data sets but does not scale well for large data sets.
 O(k(nk)^2 ) for each iteration
where n is # of data,k is # of clusters
 Samplingbased method
CLARA(Clustering LARge Applications)
CLARA (Clustering Large Applications) (1990)
 CLARA (Kaufmann and Rousseeuw in 1990)
 Built in statistical analysis packages, such as SPlus
 It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output
 Strength: deals with larger data sets than PAM
 Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased
CLARANS (“Randomized” CLARA) (1994)
 CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94)
 Draws sample of neighbors dynamically
 The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids
 If the local optimum is found, it starts with new randomly selected node in search for a new local optimum
 Advantages: More efficient and scalable than both PAM and CLARA
 Further improvement: Focusing techniques and spatial access structures (Ester et al.’95)
ROCK: Clustering Categorical Data
 ROCK: RObust Clustering using linKs
 S. Guha, R. Rastogi & K. Shim, ICDE’99
 Major ideas
 Use links to measure similarity/proximity
 Not distancebased
 Algorithm: samplingbased clustering
 Draw random sample
 Cluster with links
 Label data in disk
 Experiments
 Congressional voting, mushroom data
Similarity Measure in ROCK
 Traditional measures for categorical data may not work well, e.g., Jaccard coefficient
 Example: Two groups (clusters) of transactions
 C1. : {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
 C2. : {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
 Jaccard coefficient may lead to wrong clustering result
 C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
 C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
 Jaccard coefficientbased similarity function:
\[Sim(T_{1},T_{2})=\frac{T_{1} \bigcap T_{2}}{T_{1} \bigcup T_{2}}\]
 Ex. Let T1 = {a, b, c}, T2 = {c, d, e}
\[Sim(T_{1},T_{2})=\frac{\left \{ c \right \}}{a,b,c,d,e}=\frac{1}{5}=0.2\]
Link Measure in ROCK
 Clusters
 C1:: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
 C2: : {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
 Neighbors
 Two transactions are neighbors if sim(T1,T2) > threshold
 Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}
 T1 connected to: {a,b,d}, {a,b,e}, {a,c,d}, {a,c,e}, {b,c,d}, {b,c,e}, {a,b,f}, {a,b,g}
 T2 connected to: {a,c,d}, {a,c,e}, {a,d,e}, {b,c,e}, {b,d,e}, {b,c,d}
 T3 connected to: {a,b,c}, {a,b,d}, {a,b,e}, {a,b,g}, {a,f,g}, {b,f,g}
 Link Similarity
 Link similarity between two transactions is the # of common neighbors
 link(T1, T2) = 4, since they have 4 common neighbors
 {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
 link(T1, T3) = 3, since they have 3 common neighbors
 {a, b, d}, {a, b, e}, {a, b, g}
Rock Algorithm
 Method
 Compute similarity matrix
 Run agglomerative hierarchical clustering
 When the data set is big
 Get sample of transactions
 Cluster sample
 Problems:
 Guarantee cluster interconnectivity
 any two transactions in a cluster are very well connected
 Ignores information about closeness of two clusters
 two separate clusters may still be quite connected
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the singlelink method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a nondescending fashion
 Eventually all nodes belong to the same cluster
Dendrogram: Shows How Clusters are Merged
 Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram
 A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster
DIANA (Divisive Analysis)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
Distance between Clusters
 Single link: smallest distance between an element in one cluster and an element in the other, i.e.,
\[dist(K_{i}, K_{j}) = min(t_{ip}, t_{jq})\]  
 Complete link: largest distance between an element in one cluster and an element in the other, i.e.,
\[dist(K_{i}, K_{j}) = max(t_{ip}, t_{jq})\]
 Average: avg distance between an element in one cluster and an element in the other, i.e.,
\[dist(K_{i}, K_{j}) = avg(t_{ip}, t_{jq})\]
 Centroid: distance between the centroids of two clusters, i.e.,
\[dist(K_{i}, K_{j}) = dist(C_{i}, C_{j})\]
 Medoid: distance between the medoids of two clusters, i.e.,
\[dist(K_{i}, K_{j}) = dist(M_{i}, M_{j})\]
 Medoid: a chosen, centrally located object in the cluster
Centroid, Radius and Diameter of a Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster
\[C_{m}=\frac{\sum_{i=1}^{N}(t_{ip})}{N}\]  Radius: square root of average distance from any point of the cluster to its centroid
\[R_{m}=\sqrt{\frac{\sum_{i=1}^{N}(t_{ip}c_{m})^{2}}{N}}\]  Diameter: square root of average mean squared distance between all pairs of points in the cluster
\[D_{m}=\sqrt{\frac{\sum_{i=1}^{N} \sum_{i=1}^{N}(t_{ip}t_{iq})^{2}}{N(N1)}}\]
Extensions to Hierarchical Clustering
 Major weakness of agglomerative clustering methods
 Can never undo what was done previously
 Do not scale well: time complexity of at least O(n^2), where n is the number of total objects
 Integration of hierarchical & distancebased clustering
 BIRCH (1996): uses CFtree and incrementally adjusts the quality of subclusters
 CHAMELEON (1999): hierarchical clustering using dynamic modeling
BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies)
 Zhang, Ramakrishnan & Livny, SIGMOD’96
 Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering
 Phase 1: scan DB to build an initial inmemory CF tree (a multilevel compression of the data that tries to preserve the inherent clustering structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CFtree
 Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the order of the data record
Clustering Feature Vector in BIRCH
CFTree in BIRCH
 Clustering feature:
 Summary of the statistics for a given subcluster: the 0th, 1st, and 2nd moments of the subcluster from the statistical point of view
 Registers crucial measurements for computing cluster and utilizes storage efficiently
 A CF tree is a heightbalanced tree that stores the clustering features for a hierarchical clustering
 A nonleaf node in a tree has descendants or “children”
 The nonleaf nodes store sums of the CFs of their children
 A CF tree has two parameters
 Branching factor: max # of children
 Threshold: max diameter of subclusters stored at the leaf nodes
The CF Tree Structure
The Birch Algorithm
\[\sqrt{\frac{1}{n(n1)}\sum(x_{i}x_{j})^{2}}\]
 For each point in the input
 Find closest leaf entry
 Add point to leaf entry and update CF
 If entry diameter > max_diameter, then split leaf, and possibly parents
 Algorithm is O(n)
 Concerns
 Sensitive to insertion order of data points
 Since we fix the size of leaf nodes, so clusters may not be so natural
 Clusters tend to be spherical given the radius and diameter measures
CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999)
 CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
 Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters
 Graphbased, and a twophase algorithm
 Use a graphpartitioning algorithm: cluster objects into a large number of relatively small subclusters
 Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these subclusters
KNN Graphs & Interconnectivity
 knearest graphs from an original data in 2D:
 EC{Ci ,Cj } :The absolute interconnectivity between Ci and Cj: the sum of the weight of the edges that connect vertices in Ci to vertices in Cj
 Internal interconnectivity of a cluster Ci : the size of its mincut bisector ECCi (i.e., the weighted sum of edges that partition the graph into two roughly equal parts)
 Relative Interconnectivity (RI):
\[RI(C_{i},C_{j})=\frac{EC_{C_{i},C_{j}}}{\frac{EC_{C_{i}}+EC_{C_{j}}}{2}}\]
Relative Closeness & Merge of SubClusters
 Relative closeness between a pair of clusters Ci and Cj : the absolute closeness between Ci and Cj normalized w.r.t. the internal closeness of the two clusters Ci and Cj
\[RC(C_{i},C_{j})=\frac{\bar{S}_{EC_{[C_{i},C_{j}}]}}{\frac{C_{i}}{C_{i}+C_{j}}\bar{S}_{EC_{C_{i}}}+\frac{C_{j}}{C_{i}+C_{j}}\bar{S}_{EC_{C_{j}}}}\]
\[\bar{S}_{EC_{C_{i}}} and \bar{S}_{EC_{C_{i}}}\]
are the average weights of the edges that belong in the mincut bisector of clusters Ci and Cj , respectively, and
\[\bar{S}_{EC_{C_{i},C_{j}}}\]
is the average weight of the edges that connect vertices in Ci to vertices in Cj  Merge SubClusters:
 Merges only those pairs of clusters whose RI and RC are both above some userspecified thresholds
 Merge those maximizing the function that combines RI and RC
Overall Framework of CHAMELEON
CHAMELEON (Clustering Complex Objects)
Probabilistic Hierarchical Clustering
 Algorithmic hierarchical clustering
 Nontrivial to choose a good distance measure
 Hard to handle missing attribute values
 Optimization goal not clear: heuristic, local search
 Probabilistic hierarchical clustering
 Use probabilistic models to measure distances between clusters
 Generative model: Regard the set of data objects to be clustered as a sample of the underlying data generation mechanism to be analyzed
 Easy to understand, same efficiency as algorithmic agglomerative clustering method, can handle partially observed data
 In practice, assume the generative models adopt common distributions functions, e.g., Gaussian distribution or Bernoulli distribution, governed by parameters
Generative Model
 Given a set of 1D points X = {x1, …, xn} for clustering analysis & assuming they are generated by a Gaussian distribution:
\[N(\mu,\sigma^{2})=\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{\frac{(x\mu)^{2}}{2\sigma^{2}}}\]
 The probability that a point xi ∈ X is generated by the model
\[P(x_{i}\mu,\sigma^{2})=\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{\frac{(x_{i}\mu)^{2}}{2\sigma^{2}}}\]
 The likelihood that X is generated by the model:
\[L(N(\mu,\sigma^{2}):X)=P(X\mu,\sigma^{2})=\prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^{2}}}e^{\frac{(x_{i}\mu)^{2}}{2\sigma^{2}}}\]
 The task of learning the generative model: find the parameters μ and σ2 such that
\[N(\mu_{o},\sigma_{o}^{2})=argmax{L(N(\mu,\sigma^{2}):X)}\]
A Probabilistic Hierarchical Clustering Algorithm
 For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality can be measured by,
\[Q(\left \{ C_{1},…,C_{m} \right \})=\prod_{i=1}^{m} P(C_{i})\]
 where P() is the maximum likelihood
 If we merge two clusters Cj1 and Cj2 into a cluster Cj1∪Cj2, then, the change in quality of the overall clustering is
\[Q((\left \{ C_{1},…,C_{m} \right \}\left \{ C_{j1},C_{j2} \right \})U\left \{ C_{j1} U C_{j2} \right \})Q(\left \{ C_{1},…,C_{m} \right \})\]
\[=\frac{\prod_{i=1}^{m}P(C_{i}).P(C_{j1}UC_{j2})}{P(C_{j1}).P(C_{j2})}\prod_{i=1}^{m}P(C_{i})\]
\[=\prod_{i=1}^{m}P(C_{i})(\frac{P(C_{j1}UC_{j2})}{P(C_{j1})P(C_{j2})}1)\]
 Distance between clusters C1 and C2:
\[dist(C_{i},C_{j})=log\frac{P(C_{1}UC_{2})}{P(C_{1})P(C_{2})}\]
DensityBased Clustering Methods
 Clustering based on density (local cluster criterion), such as densityconnected points
 Major features:
 Discover clusters of arbitrary shape
 Handle noise
 One scan
 Need density parameters as termination condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98) (more gridbased)
DensityBased Clustering: Basic Concepts
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Epsneighbourhood of that point
 NEps(q): {p belongs to D  dist(p,q) ≤ Eps}
 Directly densityreachable: A point p is directly densityreachable from a point q w.r.t. Eps, MinPts if
 p belongs to NEps(q)
 core point condition:
NEps (q) ≥ MinPts
DensityReachable and DensityConnected
 Densityreachable:
A point p is densityreachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly densityreachable from pi
 
 Densityconnected
A point p is densityconnected to a point q w.r.t.
Eps, MinPts if there is a point o such that both, p and q are
densityreachable from o w.r.t. Eps and MinPts
 
DBSCAN: DensityBased Spatial Clustering of Applications with Noise

Relies on a
densitybased
notion of cluster: A
cluster
is defined as a maximal set of densityconnected points

Discovers clusters of arbitrary shape in spatial databases with noise
DBSCAN: The Algorithm
 Arbitrary select a point p
 Retrieve all points densityreachable from p w.r.t. Eps and MinPts
 If p is a core point, a cluster is formed
 If p is a border point, no points are densityreachable from p and DBSCAN visits the next point of the database
 Continue the process until all of the points have been processed
 If a spatial index is used, the computational complexity of DBSCAN is O(nlogn), where n is the number of database objects. Otherwise, the complexity is O(n^2)
DBSCAN: Sensitive to Parameters
OPTICS: A ClusterOrdering Method (1999)
 OPTICS: Ordering Points To Identify the Clustering Structure
 Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
 Produces a special order of the database wrt its densitybased clustering structure
 This clusterordering contains info equiv to the densitybased clusterings corresponding to a broad range of parameter settings
 Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure
 Can be represented graphically or using visualization techniques
OPTICS: Some Extension from DBSCAN
 Indexbased: k = # of dimensions, N: # of points
 Core Distance of an object p: the smallest value ε such that the εneighborhood of p has at least MinPts objects
 Let Nε(p): εneighborhood of p, ε is a distance value
Coredistanceε, MinPts(p) = Undefined if card(Nε(p)) < MinPts
MinPtsdistance(p), otherwise
 Reachability Distance of object p from core object q is the min radius value that makes p densityreachable from q
 Reachabilitydistanceε, MinPts(p, q) =
Undefined if q is not a core object
max(coredistance(q), distance (q, p)), otherwise
Core Distance & Reachability Distance
DensityBased Clustering: OPTICS & Applications
DENCLUE: Using Statistical Density Functions
 DENsitybased CLUstEring by Hinneburg & Keim (KDD’98)
 Using statistical density functions:
 Major features
 Solid mathematical foundation
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily shaped clusters in highdimensional data sets
 Significant faster than existing algorithm (e.g., DBSCAN)
 But needs a large number of parameters
Denclue: Technical Essence
 Uses grid cells but only keeps information about grid cells that do actually contain data points and manages these cells in a treebased access structure
 Influence function: describes the impact of a data point within its neighborhood
 Overall density of the data space can be calculated as the sum of the influence function of all data points
 Clusters can be determined mathematically by identifying density attractors
 Density attractors are local maximal of the overall density function
 Center defined clusters: assign to each density attractor the points density attracted to it
 Arbitrary shaped cluster: merge density attractors that are connected through paths of high density (> threshold)
CenterDefined and Arbitrary
GridBased Clustering Method
 Using multiresolution grid data structure
 Several interesting methods
 STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)
 WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
 A multiresolution clustering approach using wavelet method
 CLIQUE: Agrawal, et al. (SIGMOD’98)
 Both gridbased and subspace clustering
STING: A Statistical Information Grid Approach
 Wang, Yang and Muntz (VLDB’97)
 The spatial area is divided into rectangular cells
 There are several levels of cells corresponding to different levels of resolution
The STING Clustering Method
 Each cell at a high level is partitioned into a number of smaller cells in the next lower level
 Statistical info of each cell is calculated and stored beforehand and is used to answer queries
 Parameters of higher level cells can be easily calculated from parameters of lower level cell
 count, mean, s, min, max
 type of distribution—normal, uniform, etc.
 Use a topdown approach to answer spatial data queries
 Start from a preselected layer—typically with a small number of cells
 For each cell in the current level compute the confidence interval
STING Algorithm and Its Analysis
 Remove the irrelevant cells from further consideration
 When finish examining the current layer, proceed to the next lower level
 Repeat this process until the bottom layer is reached
 Advantages:
 Queryindependent, easy to parallelize, incremental update
 O(K), where K is the number of grid cells at the lowest level
 Disadvantages:
 All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected
CLIQUE (Clustering In QUEst)

Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)

Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space

CLIQUE can be considered as both densitybased and gridbased

It partitions each dimension into the same number of equal length interval

It partitions an mdimensional data space into nonoverlapping rectangular units

A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter

A cluster is a maximal set of connected dense units within a subspace
CLIQUE: The Major Steps

Partition the data space and find the number of points that lie inside each cell of the partition.

Identify the subspaces that contain clusters using the Apriori principle

Identify clusters

Determine dense units in all subspaces of interests

Determine connected dense units in all subspaces of interests.

Generate minimal description for the clusters

Determine maximal regions that cover a cluster of connected dense units for each cluster

Determination of minimal cover for each cluster
CLIQUE: The Major Steps (cont')
Strength and Weakness of CLIQUE

Strength

automatically
finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces

insensitive
to the order of records in input and does not presume some canonical data distribution

scales
linearly
with the size of input and has good scalability as the number of dimensions in the data increases

Weakness

The accuracy of the clustering result may be degraded at the expense of simplicity of the method
Assessing Clustering Tendency
 Assess if nonrandom structure exists in the data by measuring the probability that the data is generated by a uniform data distribution
 Test spatial randomness by statistic test: Hopkins Static
 Given a dataset D regarded as a sample of a random variable o, determine how far away o is from being uniformly distributed in the data space
 Sample n points, p1, …, pn, uniformly from D. For each pi, find its nearest neighbor in D: xi = min{dist (pi, v)} where v in D
 Sample n points, q1, …, qn, uniformly from D. For each qi, find its nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D and v ≠ qi
 Calculate the Hopkins Statistic:
\[H=\frac{\sum_{i=1}^{n}y_{i}}{\sum_{i=1}^{n}x_{i}+\sum_{i=1}^{n}y_{i}}\]
 If D is uniformly distributed, ∑ xi and ∑ yi will be close to each other and H is close to 0.5. If D is clustered, H is close to 1
Determine the Number of Clusters
 Empirical method
 # of clusters ≈√n/2 for a dataset of n points
 Elbow method
 Use the turning point in the curve of sum of within cluster variance w.r.t the # of clusters
 Cross validation method
 Divide a given data set into m parts
 Use m – 1 parts to obtain a clustering model
 Use the remaining part to test the quality of the clustering
 E.g., For each point in the test set, find the closest centroid, and use the sum of squared distance between all points in the test set and the closest centroids to measure how well the model fits the test set
 For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different k’s, and find # of clusters that fits the data the best
Measuring Clustering Quality
 Two methods: extrinsic vs. intrinsic
 Extrinsic: supervised, i.e., the ground truth is available
 Compare a clustering against the ground truth using certain clustering quality measure
 Ex. BCubed precision and recall metrics
 Intrinsic: unsupervised, i.e., the ground truth is unavailable
 Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are
 Ex. Silhouette coefficient
Measuring Clustering Quality: Extrinsic Methods
 Clustering quality measure: Q(C, Cg), for a clustering C given the ground truth Cg.
 Q is good if it satisfies the following 4 essential criteria
 Cluster homogeneity: the purer, the better
 Cluster completeness: should assign objects belong to the same category in the ground truth to the same cluster
 Rag bag: putting a heterogeneous object into a pure cluster should be penalized more than putting it into a rag bag (i.e., “miscellaneous” or “other” category)
 Small cluster preservation: splitting a small category into pieces is more harmful than splitting a large category into pieces
Summary
 Cluster analysis groups objects based on their similarity and has wide applications
 Measure of similarity can be computed for various types of data
 Clustering algorithms can be categorized into partitioning methods, hierarchical methods, densitybased methods, gridbased methods, and modelbased methods
 Kmeans and Kmedoids algorithms are popular partitioningbased clustering algorithms
 Birch and Chameleon are interesting hierarchical clustering algorithms, and there are also probabilistic hierarchical clustering algorithms
 DBSCAN, OPTICS, and DENCLU are interesting densitybased algorithms
 STING and CLIQUE are gridbased methods, where CLIQUE is also a subspace clustering algorithm
 Quality of clustering results can be evaluated in various ways
References
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD'98
 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
 M. Ankerst, M. Breunig, H.P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure, SIGMOD’99.
 Beil F., Ester M., Xu X.: "Frequent TermBased Text Clustering", KDD'02
 M. M. Breunig, H.P. Kriegel, R. Ng, J. Sander. LOF: Identifying DensityBased Local Outliers. SIGMOD 2000.
 M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A densitybased algorithm for discovering clusters in large spatial databases. KDD'96.
 M. Ester, H.P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. SSD'95.
 D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139172, 1987.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. VLDB’98.
 V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data Using Summaries. KDD'99.
References (cont')
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In Proc. VLDB’98.
 S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98.
 S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In ICDE'99, pp. 512521, Sydney, Australia, March 1999.
 A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. KDD’98.
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
 G. Karypis, E.H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 6875, 1999.
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.
 E. Knorr and R. Ng. Algorithms for mining distancebased outliers in large datasets. VLDB’98.
References (cont')
 G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988.
 R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.
 L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A Review, SIGKDD Explorations, 6(1), June 2004
 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition,.
 G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multiresolution clustering approach for very large spatial databases. VLDB’98.
 A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. ConstraintBased Clustering in Large Databases, ICDT'01.
 A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles, ICDE'01
 H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets, SIGMOD’ 02.
 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB’97.
 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method for very large databases. SIGMOD'96.
 Xiaoxin Yin, Jiawei Han, and Philip Yu, “LinkClus: Efficient Clustering via Heterogeneous Semantic Links”, in Proc. 2006 Int. Conf. on Very Large Data Bases (VLDB'06), Seoul, Korea, Sept. 2006.