Agenda
 Probability ModelBased Clustering
 Each object may take a probability to belong to a cluster
 Clustering HighDimensional Data
 Curse of dimensionality: Difficulty of distance measure in highD space
 Clustering Graphs and Network Data
 Similarity measurement and clustering methods for graph and networks
 Clustering with Constraints
 Cluster analysis under different kinds of constraints, e.g., that raised from background knowledge or spatial distribution of the objects
Fuzzy Set and Fuzzy Cluster
 Clustering methods discussed so far
 Every data object is assigned to exactly one cluster
 Some applications may need for fuzzy or soft cluster assignment
 Ex. An egame could belong to both entertainment and software
 Methods: fuzzy clusters and probabilistic modelbased clusters
 Fuzzy cluster: A fuzzy set S: FS : X → [0, 1] (value between 0 and 1)
 Example: Popularity of cameras is defined as a fuzzy mapping
 Then, A(0.05), B(1), C(0.86), D(0.27)
Fuzzy (Soft) Clustering
Example: Let cluster feature be

 Fuzzy clustering:
 k fuzzy clusters C1, …, Ck, represented as a partition matrix M = [wij]
 P1: For each object oi and cluster Cj, 0 ≤ wij ≤ 1(fuzzy set)
 P2: For each object oi,
\[\sum_{i=1}^{k} w_{ij}=1\]
(equal participation in clustering)  P3: For each cluster Cj,
\[0<\sum_{i=1}^{n} w_{ij}
(ensure no empty cluster)  Let c1, …, ck as the centers of k clusters
 For each object oi, sum of square error SSE, p is a parameter:
\[SSE(C_{j})=\sum_{i=1}^{n} w_{ij}^{p}dist(o_{i},c_{j})^2\]
\[SSE(o_{j})=\sum_{j=1}^{k} w_{ij}^{p}dist(o_{i},c_{j})^2\]
 For each cluster C, sum of square error:
Probabilistic ModelBased Clustering
 Cluster analysis is to find hidden categories.
 A hidden category (i.e., probabilistic cluster) is a distribution over the data space, which can be mathematically represented using a probability density function (or distribution function).




 A mixture model assumes that a set of observed objects is a mixture of instances from multiple probabilistic clusters, and conceptually each observed object is generated independently
 Our task : infer a set of k probabilistic clusters that is mostly likely to generate D using the above data generation process
ModelBased Clustering
 A set C of k probabilistic clusters C1, …,Ck with probability density functions f1, …, fk, respectively, and their probabilities ω1, …, ωk.
 Probability of an object o generated by cluster Cj is
\[P(oC_{j})=w_{j}f_{j}(o)\]
 Probability of o generated by the set of cluster C is
\[P(oC)=\sum_{j=1}^{k}w_{j}f_{j}(o)\]
 Since objects are assumed to be generated independently, for a data set D = {o1, …, on}, we have,
\[P(DC)=\prod_{i=1}^{n}P(o_{i}C)=\prod_{i=1}^{n}\sum_{j=1}^{k}w_{j}f_{j}(o_{i})\]
 Task: Find a set C of k probabilistic clusters s.t. P(DC) is maximized
 However, maximizing P(DC) is often intractable since the probability density function of a cluster can take an arbitrarily complicated form
 To make it computationally feasible (as a compromise), assume the probability density functions being some parameterized distributions
Univariate Gaussian Mixture Model
 O = {o1, …, on} (n observed objects), Θ = {θ1, …, θk} (parameters of the k distributions), and Pj(oi θj) is the probability that oi is generated from the jth distribution using parameter θj, we have
\[P(o_{i}\theta)=\sum_{j=1}^{k}w_{j}P_{j}(o_{i}\theta_{j} )\]
\[P(O\theta)=\prod_{i=1}^{n}\sum_{j=1}^{k}w_{j}P_{j}(o_{i}\theta_{j} )\]
 Univariate Gausian mixture model
 Assume the probability density function of each cluster follows a 1d Gaussian distribution. Suppose that there are k clusters.
 The probability density function of each cluster are centered at μj with standard deviation σj, θj, = (μj, σj), we have
\[P(o_{i}\theta_{j})=\frac{1}{\sqrt{2\pi}\sigma_{j} }e^{\frac{(o_{i}\mu_{j})^2 }{2\sigma^{2} }}\]
\[P(o_{i}\theta)=\sum_{j=1}^{k}\frac{1}{\sqrt{2\pi}\sigma_{j} }e^{\frac{(o_{i}\mu_{j})^2 }{2\sigma^{2} }}\]
\[P(O\theta)=\prod_{i=1}^{n}\sum_{j=1}^{k}\frac{1}{\sqrt{2\pi}\sigma_{j} }e^{\frac{(o_{i}\mu_{j})^2 }{2\sigma^{2} }}\]
The EM (Expectation Maximization) Algorithm
 The kmeans algorithm has two steps at each iteration:
 Expectation Step (Estep): Given the current cluster centers, each object is assigned to the cluster whose center is closest to the object: An object is expected to belong to the closest cluster
 Maximization Step (Mstep): Given the cluster assignment, for each cluster, the algorithm adjusts the center so that the sum of distance from the objects assigned to this cluster and the new center is minimized
 The (EM) algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models.
 Estep assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters
 Mstep finds the new clustering or parameters that minimize the sum of squared error (SSE) or the expected likelihood
Fuzzy Clustering Using the EM Algorithm
 Initially, let c1 = a and c2 = b
 1st Estep: assign o to c1,
\[w. wt =\frac{\frac{1}{dist(o,c_{1})^2}}{\frac{1}{dist(o,c_{1})^2}+\frac{1}{dist(o,c_{2})^2}}=\frac{dist(o,c_{2})^2}{dist(o,c_{1})^2+dist(o,c_{2})^2}\]
\[w_{c,c_{1}}=\frac{41}{45+41}=0.48\]  1st Mstep: recalculate the centroids according to the partition matrix, minimizing the sum of squared error (SSE)
\[ c_{j}=\frac{\sum_{eachpoint_{o}}w_{o,c_{j}}^{2}o }{\sum_{eachpoint_{o}}w_{o,c_{j}}^{2}} \]
\[c_{1}=(\frac{1^{2}\times 3+ 0^{2}\times4+ 0.48^{2}\times9+ 0.42^{2}\times14+ 0.41^{2}\times18+ 0.47^{2}\times21}{1^{2}+0^{2}+0.48^{2}+0.42^{2}+0.41^{2}+0.47^{2}},\frac{1^{2}\times3+ 0^{2}\times10+ 0.48^{2}\times6+ 0.42^{2}\times8+ 0.41^{2}\times11+ 0.47^{2}\times7}{1^{2}+0^{2}+0.48^{2}+0.42^{2}+0.41^{2}+0.47^{2}}) =(8.47,5.12) \]
 Iteratively calculate this until the cluster centers converge or the change is small enough
Computing Mixture Models with EM
 Given n objects O = {o1, …, on}, we want to mine a set of parameters Θ = {θ1, …, θk} s.t.,P(OΘ) is maximized, where θj = (μj, σj) are the mean and standard deviation of the jth univariate Gaussian distribution
 We initially assign random values to parameters θj, then iteratively conduct the E and M steps until converge or sufficiently small change
 At the Estep, for each object oi, calculate the probability that oi belongs to each distribution,
\[P(\theta_{j} o_{i},\theta)=\frac{P(o_{i}\theta_{j})}{\sum_{l=1}^{k}P(o_{i} \theta_{l})}\]
 At the Mstep, adjust the parameters θj = (μj, σj) so that the expected likelihood P(OΘ) is maximized
\[\mu_{j}=\sum_{i=1}^{n}o_{i}\frac{P(\theta_{j} o_{i},\theta)}{\sum_{l=1}^{n}P(\theta_{j} o_{l},\theta)}=\frac{{\sum_{i=1}^{n}o_{i}P(\theta_{j} o_{i},\theta)}}{{\sum_{i=1}^{n}P(\theta_{j} o_{i},\theta)}}\]
\[\sigma_{j}=\sqrt{\frac{\sum_{i=1}^{n}P(\theta_{j} o_{i},\theta)(o_{i}\mu_ {j})^{2}}{\sum_{i=1}^{n}P(\theta_{j} o_{i},\theta)}}\]
Advantages and Disadvantages of Mixture Models
 Strength
 Mixture models are more general than partitioning and fuzzy clustering
 Clusters can be characterized by a small number of parameters
 The results may satisfy the statistical assumptions of the generative models
 Weakness
 Converge to local optimal (overcome: run multitimes w. random initialization)
 Computationally expensive if the number of distributions is large, or the data set contains very few observed data points
 Need large data sets
 Hard to estimate the number of clusters
Clustering HighDimensional Data
 Clustering highdimensional data (How high is highD in clustering?)
 Many applications: text documents, DNA microarray data
 Major challenges:
 Many irrelevant dimensions may mask clusters
 Distance measure becomes meaningless—due to equidistance
 Clusters may exist only in some subspaces
 Methods
 Subspaceclustering: Search for clusters existing in subspaces of the given high dimensional data space
 CLIQUE, ProClus, and biclustering approaches
 Subspaceclustering: Search for clusters existing in subspaces of the given high dimensional data space
 Dimensionality reduction approaches: Construct a much lower dimensional space and search for clusters there (may construct new dimensions by combining some dimensions in the original data)
 Dimensionality reduction methods and spectral clustering
Traditional Distance Measures May Not Be Effective on HighD Data
 Traditional distance measure could be dominated by noises in many dimensions
 Ex. Which pairs of customers are more similar?
 By Euclidean distance, we get,
\[dist(Ada,Bob)=dist(Bob,Cathy)=dist(Ada,Cathy)=\sqrt{2}\]
 despite Ada and Bob look less similar
 Clustering should not only consider dimensions but also attributes (features)
 Feature transformation: effective if most dimensions are relevant (PCA & SVD useful when features are highly correlated/redundant)
 Feature selection: useful to find a subspace where the data have nice clusters
The Curse of Dimensionality
 Data in only one dimension is relatively packed
 Adding a dimension “stretch” the points across that dimension, making them further apart
 Adding more dimensions will make the points further apart—high dimensional data is extremely sparse
 Distance measure becomes meaningless—due to equidistance
Why Subspace Clustering?
 Clusters may exist only in some subspaces
 Subspaceclustering: find clusters in all the subspaces
Subspace Clustering Methods
 Subspace search methods: Search various subspaces to find clusters
 Bottomup approaches
 Topdown approaches
 Correlationbased clustering methods
 E.g., PCA based approaches
 Biclustering methods
 Optimizationbased methods
 Enumeration methods
Subspace Clustering Method (I): Subspace Search Methods
 Search various subspaces to find clusters
 Bottomup approaches
 Start from lowD subspaces and search higherD subspaces only when there may be clusters in such subspaces
 Various pruning techniques to reduce the number of higherD subspaces to be searched
 Ex. CLIQUE (Agrawal et al. 1998)
 Topdown approaches
 Start from full space and search smaller subspaces recursively
 Effective only if the locality assumption holds: restricts that the subspace of a cluster can be determined by the local neighborhood
 Ex. PROCLUS (Aggarwal et al. 1999): a kmedoidlike method
CLIQUE: SubSpace Clustering with Aprori Pruning
Subspace Clustering Method (II): CorrelationBased Methods
 Subspace search method: similarity based on distance or density
 Correlationbased method: based on advanced correlation models
 Ex. PCAbased approach:
 Apply PCA (for Principal Component Analysis) to derive a set of new, uncorrelated dimensions,
 then mine clusters in the new space or its subspaces
 Other space transformations:
 Hough transform
 Fractal dimensions
Subspace Clustering Method (II): CorrelationBased Methods
 Biclustering: Cluster both objects and attributes simultaneously (treat objs and attrs in symmetric way)
 Four requirements:
 Only a small set of objects participate in a cluster
 A cluster only involves a small number of attributes
 An object may participate in multiple clusters, or does not participate in any cluster at all
 An attribute may be involved in multiple clusters, or is not involved in any cluster at all
 Only a small set of objects participate in a cluster
 Ex 1. Gene expression or microarray data: a gene sample/condition matrix.
 Each element in the matrix, a real number, records the expression level of a gene under a specific condition
 Each element in the matrix, a real number, records the expression level of a gene under a specific condition
 Ex. 2. Clustering customers and products
 Another biclustering problem
Types of Biclusters
 Let A = {a1, ..., an} be a set of genes, B = {b1, …, bn} a set of conditions
 A bicluster: A submatrix where genes and conditions follow some consistent patterns
 4 types of biclusters (ideal cases)
 Biclusters with constant values:
 for any i in I and j in J, eij = c
 Biclusters with constant values on rows:
 eij = c + αi
 Also, it can be constant values on columns
 Biclusters with coherent values (aka. patternbased clusters)
 eij = c + αi + βj
 Biclusters with coherent evolutions on rows
 (ei1j1− ei1j2)(ei2j1− ei2j2) ≥
 i.e., only interested in the up or down regulated changes across genes or conditions without constraining on the exact values
BiClustering Methods
 Realworld data is noisy: Try to find approximate biclusters
 Methods: Optimizationbased methods vs. enumeration methods
 Optimizationbased methods
 Try to find a submatrix at a time that achieves the best significance as a bicluster
 Due to the cost in computation, greedy search is employed to find local optimal biclusters
 Ex. δCluster Algorithm (Cheng and Church, ISMB’2000)
 Enumeration methods
 Use a tolerance threshold to specify the degree of noise allowed in the biclusters to be mined
 Then try to enumerate all submatrices as biclusters that satisfy the requirements
 Ex. δpCluster Algorithm (H. Wang et al.’ SIGMOD’2002, MaPle: Pei et al., ICDM’2003)
BiClustering for MicroArray Data Analysis
 Left figure: Microarray “raw” data shows 3 genes and their values in a multiD space: Difficult to find their patterns
 Right two: Some subsets of dimensions form nice shift and scaling patterns
 No globally defined similarity/distance measure
 Clusters may not be exclusive
 An object can appear in multiple clusters
BiClustering (I): δBiCluster
 For a submatrix I x J
 The mean of the ith row:
\[e_{iJ}=\frac{1}{J} \sum_{j\epsilon J}e_{ij}\]
 The mean of the jth column:
\[e_{Ij}=\frac{1}{I} \sum_{i\epsilon I}e_{ij}\]
 The mean of all elements in the submatrix:
\[e_{IJ}=\frac{1}{IJ} \sum_{i\epsilon I,j\epsilon J}e_{ij}=\frac{1}{I} \sum_{i\epsilon I}e_{iJ}=\frac{1}{J} \sum_{j\epsilon J}e_{Ij}\]
 The quality of the submatrix as a bicluster can be measured by the mean squared residue value
\[H(I\times J)=\frac{1}{IJ} \sum_{i\epsilon I,j\epsilon J}(e_{ij}e_{iJ}e_{Ij}+e_{IJ})^{2}\]
 A submatrix I x J is δbicluster if H(I x J) ≤ δ where δ ≥ 0 is a threshold. When δ = 0, I x J is a perfect bicluster with coherent values. By setting δ > 0, a user can specify the tolerance of average noise per element against a perfect bicluster
 residue(eij) = eij − eiJ − eIj + eIJ
BiClustering (I): The δCluster Algorithm
 Maximal δbicluster is a δbicluster I x J such that there does not exist another δbicluster I′ x J′ which contains I x J
 Computing is costly: Use heuristic greedy search to obtain local optimal clusters
 Two phase computation: deletion phase and additional phase
 Deletion phase: Start from the whole matrix, iteratively remove rows and columns while the mean squared residue of the matrix is over δ
 At each iteration, for each row/column, compute the mean squared residue:
\[d(i)=\frac{1}{J}\sum_{j\epsilon J}(e_{ij}e_{iJ}e_{Ij}+e_{IJ})^{2}\]
\[d(j)=\frac{1}{I}\sum_{i\epsilon I}(e_{ij}e_{iJ}e_{Ij}+e_{IJ})^{2}\]
 Remove the row or column of the largest mean squared residue
 Addition phase:
 Expand iteratively the δbicluster I x J obtained in the deletion phase as long as the δbicluster requirement is maintained
 Consider all the rows/columns not involved in the current bicluster I x J by calculating their mean squared residues
 A row/column of the smallest mean squared residue is added into the current δbicluster
 It finds only one δbicluster, thus needs to run multiple times: replacing the elements in the output bicluster by random numbers
BiClustering (II): δpCluster
 Enumerating all biclusters (δpClusters) [H. Wang, et al., Clustering by pattern similarity in large data sets. SIGMOD’02]
 Since a submatrix I x J is a bicluster with (perfect) coherent values iff ei1j1 − ei2j1 = ei1j2 − ei2j2. For any 2 x 2 submatrix of I x J, define pscore
\[pscore \begin{pmatrix} e_{i1j1} &e_{i1j2} \\ e_{i2j1}&e_{i2j2} \end{pmatrix} =(e_{i1j1}e_{i2j1})(e_{i1j2}e_{i2j2})\]
 A submatrix I x J is a δpCluster (patternbased cluster) if the pscore of every 2 x 2 submatrix of I x J is at most δ, where δ ≥ 0 is a threshold specifying a user's tolerance of noise against a perfect bicluster
 The pscore controls the noise on every element in a bicluster, while the mean squared residue captures the average noise
 Monotonicity: If I x J is a δpClusters, every x x y (x,y ≥ 2) submatrix of I x J is also a δpClusters.
 A δpCluster is maximal if no more row or column can be added into the cluster and retain δpCluster: We only need to compute all maximal δpClusters.
MaPle: Efficient Enumeration of δpClusters
 Pei et al., MaPle: Efficient enumerating all maximal δpClusters. ICDM'03
 Framework: Same as patterngrowth in frequent pattern mining (based on the downward closure property)
 For each condition combination J, find the maximal subsets of genes I such that I x J is a δpClusters
 If I x J is not a submatrix of another δpClusters
 then I x J is a maximal δpCluster.
 Algorithm is very similar to mining frequent closed itemsets
 Additional advantages of δpClusters:
 Due to averaging of δcluster, it may contain outliers but still within δthreshold
 Computing biclusters for scaling patterns, take logarithmic on
 will lead to the pscore form
\[\frac{d_{xa}/d_{ya}}{d_{xb}/d_{yb}}<\delta \]
DimensionalityReduction Methods
 Dimensionality reduction: In some situations, it is more effective to construct a new space instead of using some subspaces of the original data
 Ex. To cluster the points in the following figure, any subspace of the original one, X and Y, cannot help, since all the three clusters will be projected into the overlapping areas in X and Y axes.
 Construct a new dimension as the dashed one, the three clusters become apparent when the points projected into the new dimension
 Dimensionality reduction methods
 Feature selection and extraction: But may not focus on clustering structure finding
 Spectral clustering: Combining feature extraction and clustering (i.e., use the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions)
 Normalized Cuts (Shi and Malik, CVPR’97 or PAMI’2000)
 The NgJordanWeiss algorithm (NIPS’01)
Spectral Clustering: The NgJordanWeiss (NJW) Algorithm
 Given a set of objects o1, …, on, and the distance between each pair of objects, dist(oi, oj), find the desired number k of clusters
 Calculate an affinity matrix W, where σ is a scaling parameter that controls how fast the affinity Wij decreases as dist(oi, oj) increases. In NJW, set Wij = 0
\[D_{ii}=\sum_{j=1}^{n}W_{ij}\]
 Derive a matrix A = f(W). NJW defines a matrix D to be a diagonal matrix s.t. Dii is the sum of the ith row of W, i.e.,
\[W_{ij}=e^{\frac{dist(o_{i},o_{j})}{\sigma^{2}}}\]
 Then, A is set to
\[A=D^{\frac{1}{2}}WD^{\frac{1}{2}}\]
 A spectral clustering method finds the k leading eigenvectors of A
 A vector v is an eigenvector of matrix A if Av = λv, where λ is the corresponding eigenvalue
 Using the k leading eigenvectors, project the original data into the new space defined by the k leading eigenvectors, and run a clustering algorithm, such as kmeans, to find k clusters
 Assign the original data points to clusters according to how the transformed points are assigned in the clusters obtained
Spectral Clustering: Illustration and Comments
 Spectral clustering: Effective in tasks like image processing
 Scalability challenge: Computing eigenvectors on a large matrix is costly
 Can be combined with other clustering methods, such as biclustering
Clustering Graphs and Network Data
 Applications
 Bipartite graphs, e.g., customers and products, authors and conferences
 Web search engines, e.g., click through graphs and Web graphs
 Social networks, friendship/coauthor graphs
 Similarity measures
 Geodesic distances
 Distance based on random walk (SimRank)
 Graph clustering methods
 Minimum cuts: FastModularity (Clauset, Newman & Moore, 2004)
 Densitybased clustering: SCAN (Xu et al., KDD’2007)
Similarity Measure (I): Geodesic Distance
 Geodesic distance (A, B): length (i.e., # of edges) of the shortest path between A and B (if not connected, defined as infinite)
 Eccentricity of v, eccen(v): The largest geodesic distance between v and any other vertex u ∈ V − {v}.
 E.g., eccen(a) = eccen(b) = 2; eccen(c) = eccen(d) = eccen(e) = 3
 Radius of graph G: The minimum eccentricity of all vertices, i.e., the distance between the “most central point” and the “farthest border”
 r = min v∈V eccen(v)
 E.g., radius (g) = 2
 Diameter of graph G: The maximum eccentricity of all vertices, i.e., the largest distance between any pair of vertices in G
 d = max v∈V eccen(v)
 E.g., diameter (g) = 3
 A peripheral vertex is a vertex that achieves the diameter.
 E.g., Vertices c, d, and e are peripheral vertices
SimRank: Similarity Based on Random Walk and Structural Context
 SimRank: structuralcontext similarity, i.e., based on the similarity of its neighbors
 In a directed graph G = (V,E),
 individual inneighborhood of v: I(v) = {u  (u, v) ∈ E}
 individual outneighborhood of v: O(v) = {w  (v, w) ∈ E}
 Similarity in SimRank:
\[s(u,v)=\frac{C}{I(u)I(v)}\sum_{x\epsilon I(u)}\sum_{y\epsilon I(v) }s(x,y) \]
 Initialization:
\[ s_{0}(u,v)=\left\{\begin{matrix}0, if u\neq v \\ 1,if u=v \end{matrix}\right. \]
 Then we can compute si+1 from si based on the definition
 Similarity based on random walk: in a strongly connected component
 Expected distance:
\[d(u,v)=\sum_{t:(u\rightarrow v)}P[t]l(t)\]
SimRank: Similarity Based on Random Walk and Structural Context (cont')
 Expected meeting distance:
\[m(u,v)=\sum_{t:(u,v)\rightarrow (x,x)}P[t]l(t)\]
 Expected meeting probability:
\[p(u,v)=\sum_{t:(u,v)\rightarrow (x,x)}P[t]C^{l(t)}\]
P[t] is the probability of the tour
Graph Clustering: Sparsest Cut
 G = (V,E). The cut set of a cut is the set of edges {(u, v) ∈ E  u ∈ S, v ∈ T } and S and T are in two partitions
 Size of the cut: # of edges in the cut set
 Mincut (e.g., C1) is not a good partition
 A better measure: Sparsity:
φ=the size of the cut / min{S,T}
 A cut is sparsest if its sparsity is not greater than that of any other cut
 Ex. Cut C2 = ({a, b, c, d, e, f, l}, {g, h, i, j, k}) is the sparsest cut
 For k clusters, the modularity o a clustering assesses the quality of the clustering:
\[Q=\sum_{i=1}^{k}(\frac{l_{i}}{E}(\frac{d_{i}}{2E})^{2})\]
li: # edges between vertices in the ith cluster
di: the sum of the degrees of the vertices in the ith cluster
 The modularity of a clustering of a graph is the difference between the fraction of all edges that fall into individual clusters and the fraction that would do so if the graph vertices were randomly connected
 The optimal clustering of graphs maximizes the modularity
Graph Clustering: Challenges of Finding Good Cuts
 High computational cost
 Many graph cut problems are computationally expensive
 The sparsest cut problem is NPhard
 Need to tradeoff between efficiency/scalability and quality
 Sophisticated graphs
 May involve weights and/or cycles.
 High dimensionality
 A graph can have many vertices. In a similarity matrix, a vertex is represented as a vector (a row in the matrix) whose dimensionality is the number of vertices in the graph
 Sparsity
 A large graph is often sparse, meaning each vertex on average connects to only a small number of other vertices
 A similarity matrix from a large sparse graph can also be sparse
Two Approaches for Graph Clustering
 Two approaches for clustering graph data
 Use generic clustering methods for highdimensional data
 Designed specifically for clustering graphs
 Using clustering methods for highdimensional data
 Extract a similarity matrix from a graph using a similarity measure
 A generic clustering method can then be applied on the similarity matrix to discover clusters
 Ex. Spectral clustering: approximate optimal graph cut solutions
 Methods specific to graphs
 Search the graph to find wellconnected components as clusters
 Ex. SCAN (Structural Clustering Algorithm for Networks)
 X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A Structural Clustering Algorithm for Networks”, KDD'07
SCAN: DensityBased Clustering of Networks
 How many clusters?
 What size should they be?
 What is the best partitioning?
 Should some points be segregated?
An Example Network
 Application: Given simply information of who associates with whom, could one identify clusters of individuals with common interests or special relationships (families, cliques, terrorist cells)?
A Social Network Model
 Cliques, hubs and outliers
 Individuals in a tight social group, or clique, know many of the same people, regardless of the size of the group
 Individuals who are hubs know many people in different groups but belong to no single group. Politicians, for example bridge multiple groups
 Individuals who are outliers reside at the margins of society. Hermits, for example, know few people and belong to no group
 The Neighborhood of a Vertex
 Define Γ(v) as the immediate neighborhood of a vertex (i.e. the set of people that an individual knows )
Structure Similarity
 The desired features tend to be captured by a measure we call Structural Similarity
\[\sigma(v,w)=\frac{\tau (v)\bigcap \tau (w)}{\sqrt{\tau (v) \tau (w)}} \]  Structural similarity is large for members of a clique and small for hubs and outliers
Structural Connectivity [1]
 εNeighborhood:
\[ N_{\varepsilon }(v)=\left \{ w\epsilon \tau (v)\sigma (v,w)\geq \varepsilon \right \} \]
 Core:
\[CORE_{\varepsilon,\mu }(v)\Leftrightarrow N_{\varepsilon } (v)\geq \mu \]
 Direct structure reachable:
\[DirRECH_{\varepsilon,\mu }(v,w)\Leftrightarrow CORE_{\varepsilon,\mu }(v)\wedge w\epsilon N_{\varepsilon } (v)\]
 Structure reachable: transitive closure of direct structure reachability
 Structure connected:
\[CONNECT_{\varepsilon,\mu }(v,w)\Leftrightarrow \exists u\epsilon V:RECH_{\varepsilon,\mu }(u,v)\wedge RECH_{\varepsilon,\mu }(u,w)\]
StructureConnected Clusters
 Structureconnected cluster C
 Connectivity:
\[\forall v,w\epsilon C:CONNECT_{\varepsilon ,\mu}(v,w)\]
 Maximality:
\[\forall v,w\epsilon V:v\epsilon C\wedge REACH_{\varepsilon ,\mu}(v,w)\Rightarrow w\epsilon C\]
 Hubs:
 Not belong to any cluster
 Bridge to many clusters
 Outliers:
 Not belong to any cluster
 Connect to less clusters