Agenda

  • Probability Model-Based Clustering
    • Each object may take a probability to belong to a cluster
  • Clustering High-Dimensional Data
    • Curse of dimensionality: Difficulty of distance measure in high-D space
  • Clustering Graphs and Network Data
    • Similarity measurement and clustering methods for graph and networks
  • Clustering with Constraints
    • Cluster analysis under different kinds of constraints, e.g., that raised from background knowledge or spatial distribution of the objects

Fuzzy Set and Fuzzy Cluster

  • Clustering methods discussed so far
    • Every data object is assigned to exactly one cluster
  • Some applications may need for fuzzy or soft cluster assignment
    • Ex. An e-game could belong to both entertainment and software
  • Methods: fuzzy clusters and probabilistic model-based clusters
  • Fuzzy cluster: A fuzzy set S: FS : X → [0, 1] (value between 0 and 1)
  • Example: Popularity of cameras is defined as a fuzzy mapping

  • Then, A(0.05), B(1), C(0.86), D(0.27)

Fuzzy (Soft) Clustering

Example: Let cluster feature be

  • C1: “digital camera” & “lense”
  • C2: “computer”
 

  • Fuzzy clustering:
  • k fuzzy clusters C1, …, Ck, represented as a partition matrix M = [wij]
  • P1: For each object oi and cluster Cj, 0 ≤ wij ≤ 1(fuzzy set)
  • P2: For each object oi,
    \[\sum_{i=1}^{k} w_{ij}=1\]
    (equal participation in clustering)
  • P3: For each cluster Cj,
    \[0<\sum_{i=1}^{n} w_{ij}
    (ensure no empty cluster)
  • Let c1, …, ck as the centers of k clusters
  • For each object oi, sum of square error SSE, p is a parameter:

\[SSE(C_{j})=\sum_{i=1}^{n} w_{ij}^{p}dist(o_{i},c_{j})^2\]

\[SSE(o_{j})=\sum_{j=1}^{k} w_{ij}^{p}dist(o_{i},c_{j})^2\]

  • For each cluster C, sum of square error:

\[SSE(C)=\sum_{i=1}^{n} \sum_{j=1}^{k} w_{ij}^{p}dist(o_{i},c_{j})^2\]

Probabilistic Model-Based Clustering

  • Cluster analysis is to find hidden categories.
  • A hidden category (i.e., probabilistic cluster) is a distribution over the data space, which can be mathematically represented using a probability density function (or distribution function).
   
  • Ex. 2 categories for digital cameras sold
    • consumer line vs. professional line
    • density functions f1, f2 for C1, C2
    • obtained by probabilistic clustering
  • A mixture model assumes that a set of observed objects is a mixture of instances from multiple probabilistic clusters, and conceptually each observed object is generated independently
  • Our task : infer a set of k probabilistic clusters that is mostly likely to generate D using the above data generation process

Model-Based Clustering

  • A set C of k probabilistic clusters C1, …,Ck with probability density functions f1, …, fk, respectively, and their probabilities ω1, …, ωk.
  • Probability of an object o generated by cluster Cj is

\[P(o|C_{j})=w_{j}f_{j}(o)\]

  • Probability of o generated by the set of cluster C is

\[P(o|C)=\sum_{j=1}^{k}w_{j}f_{j}(o)\]

  • Since objects are assumed to be generated independently, for a data set D = {o1, …, on}, we have,

\[P(D|C)=\prod_{i=1}^{n}P(o_{i}|C)=\prod_{i=1}^{n}\sum_{j=1}^{k}w_{j}f_{j}(o_{i})\]

  • Task: Find a set C of k probabilistic clusters s.t. P(D|C) is maximized
  • However, maximizing P(D|C) is often intractable since the probability density function of a cluster can take an arbitrarily complicated form
  • To make it computationally feasible (as a compromise), assume the probability density functions being some parameterized distributions

Univariate Gaussian Mixture Model

  • O = {o1, …, on} (n observed objects), Θ = {θ1, …, θk} (parameters of the k distributions), and Pj(oi| θj) is the probability that oi is generated from the j-th distribution using parameter θj, we have

\[P(o_{i}|\theta)=\sum_{j=1}^{k}w_{j}P_{j}(o_{i}|\theta_{j} )\]

\[P(O|\theta)=\prod_{i=1}^{n}\sum_{j=1}^{k}w_{j}P_{j}(o_{i}|\theta_{j} )\]

  • Univariate Gausian mixture model
    • Assume the probability density function of each cluster follows a 1-d Gaussian distribution. Suppose that there are k clusters.
    • The probability density function of each cluster are centered at μj with standard deviation σj, θj, = (μj, σj), we have

\[P(o_{i}|\theta_{j})=\frac{1}{\sqrt{2\pi}\sigma_{j} }e^{-\frac{(o_{i}-\mu_{j})^2 }{2\sigma^{2} }}\]

\[P(o_{i}|\theta)=\sum_{j=1}^{k}\frac{1}{\sqrt{2\pi}\sigma_{j} }e^{-\frac{(o_{i}-\mu_{j})^2 }{2\sigma^{2} }}\]

\[P(O|\theta)=\prod_{i=1}^{n}\sum_{j=1}^{k}\frac{1}{\sqrt{2\pi}\sigma_{j} }e^{-\frac{(o_{i}-\mu_{j})^2 }{2\sigma^{2} }}\]

The EM (Expectation Maximization) Algorithm

  • The k-means algorithm has two steps at each iteration:
    • Expectation Step (E-step): Given the current cluster centers, each object is assigned to the cluster whose center is closest to the object: An object is expected to belong to the closest cluster
    • Maximization Step (M-step): Given the cluster assignment, for each cluster, the algorithm adjusts the center so that the sum of distance from the objects assigned to this cluster and the new center is minimized
  • The (EM) algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models.
    • E-step assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters
    • M-step finds the new clustering or parameters that minimize the sum of squared error (SSE) or the expected likelihood

Fuzzy Clustering Using the EM Algorithm

  • Initially, let c1 = a and c2 = b
  • 1st E-step: assign o to c1,
    \[w. wt =\frac{\frac{1}{dist(o,c_{1})^2}}{\frac{1}{dist(o,c_{1})^2}+\frac{1}{dist(o,c_{2})^2}}=\frac{dist(o,c_{2})^2}{dist(o,c_{1})^2+dist(o,c_{2})^2}\]
    \[w_{c,c_{1}}=\frac{41}{45+41}=0.48\]

  • 1st M-step:  recalculate the centroids according to the partition matrix, minimizing the sum of squared error (SSE)

\[ c_{j}=\frac{\sum_{eachpoint_{o}}w_{o,c_{j}}^{2}o }{\sum_{eachpoint_{o}}w_{o,c_{j}}^{2}} \]

\[c_{1}=(\frac{1^{2}\times 3+ 0^{2}\times4+ 0.48^{2}\times9+ 0.42^{2}\times14+ 0.41^{2}\times18+ 0.47^{2}\times21}{1^{2}+0^{2}+0.48^{2}+0.42^{2}+0.41^{2}+0.47^{2}},\frac{1^{2}\times3+ 0^{2}\times10+ 0.48^{2}\times6+ 0.42^{2}\times8+ 0.41^{2}\times11+ 0.47^{2}\times7}{1^{2}+0^{2}+0.48^{2}+0.42^{2}+0.41^{2}+0.47^{2}}) =(8.47,5.12) \]

  • Iteratively calculate this until the cluster centers converge or the change is small enough

Computing Mixture Models with EM

  • Given n objects O = {o1, …, on}, we want to mine a set of parameters Θ = {θ1, …, θk} s.t.,P(O|Θ) is maximized, where θj = (μj, σj) are the mean and standard deviation of the j-th univariate Gaussian distribution
  • We initially assign random values to parameters θj, then iteratively conduct the E- and M- steps until converge or sufficiently small change
  • At the E-step, for each object oi, calculate the probability that oi belongs to each distribution,

\[P(\theta_{j} |o_{i},\theta)=\frac{P(o_{i}|\theta_{j})}{\sum_{l=1}^{k}P(o_{i} |\theta_{l})}\]

  • At the M-step, adjust the parameters θj = (μj, σj) so that the expected likelihood P(O|Θ) is maximized

\[\mu_{j}=\sum_{i=1}^{n}o_{i}\frac{P(\theta_{j} |o_{i},\theta)}{\sum_{l=1}^{n}P(\theta_{j} |o_{l},\theta)}=\frac{{\sum_{i=1}^{n}o_{i}P(\theta_{j} |o_{i},\theta)}}{{\sum_{i=1}^{n}P(\theta_{j} |o_{i},\theta)}}\]

\[\sigma_{j}=\sqrt{\frac{\sum_{i=1}^{n}P(\theta_{j} |o_{i},\theta)(o_{i}-\mu_ {j})^{2}}{\sum_{i=1}^{n}P(\theta_{j} |o_{i},\theta)}}\]

Advantages and Disadvantages of Mixture Models

  • Strength
    • Mixture models are more general than partitioning and fuzzy clustering
    • Clusters can be characterized by a small number of parameters
    • The results may satisfy the statistical assumptions of the generative models
  • Weakness
    • Converge to local optimal (overcome: run multi-times w. random initialization)
    • Computationally expensive if the number of distributions is large, or the data set contains very few observed data points
    • Need large data sets
    • Hard to estimate the number of clusters

Clustering High-Dimensional Data

  •  Clustering high-dimensional data (How high is high-D in clustering?)
    • Many applications: text documents, DNA micro-array data
    • Major challenges: 
      • Many irrelevant dimensions may mask clusters
      • Distance measure becomes meaningless—due to equi-distance
      • Clusters may exist only in some subspaces
  • Methods
    • Subspace-clustering: Search for clusters existing in subspaces of the given high dimensional data space
      • CLIQUE, ProClus, and bi-clustering approaches
  • Dimensionality reduction approaches: Construct a much lower dimensional space and search for clusters there (may construct new dimensions by combining some dimensions in the original data)
    • Dimensionality reduction methods and spectral clustering

Traditional Distance Measures May Not Be Effective on High-D Data

  • Traditional distance measure could be dominated by noises in many dimensions
  • Ex. Which pairs of customers are more similar?
  • By Euclidean distance, we get, 

\[dist(Ada,Bob)=dist(Bob,Cathy)=dist(Ada,Cathy)=\sqrt{2}\]

    • despite Ada and Bob look less similar
  • Clustering should not only consider dimensions but also attributes (features)
    • Feature transformation: effective if most dimensions are relevant (PCA & SVD useful when features are highly correlated/redundant)
    • Feature selection: useful to find a subspace where the data have nice clusters

The Curse of Dimensionality

  • Data in only one dimension is relatively packed
  • Adding a dimension “stretch” the points across that dimension, making them further apart
  • Adding more dimensions will make the points further apart—high dimensional data is extremely sparse
  • Distance measure becomes meaningless—due to equi-distance

Why Subspace Clustering?

  • Clusters may exist only in some subspaces
  • Subspace-clustering: find clusters in all the subspaces  

Subspace Clustering Methods

  • Subspace search methods: Search various subspaces to find clusters
    • Bottom-up approaches
    • Top-down approaches
  • Correlation-based clustering methods
    • E.g., PCA based approaches
  • Bi-clustering methods
    • Optimization-based methods
    • Enumeration methods

Subspace Clustering Method (I): Subspace Search Methods

  • Search various subspaces to find clusters
  • Bottom-up approaches
    • Start from low-D subspaces and search higher-D subspaces only when there may be clusters in such subspaces
    • Various pruning techniques to reduce the number of higher-D subspaces to be searched
    • Ex. CLIQUE (Agrawal et al. 1998)
  • Top-down approaches
    • Start from full space and search smaller subspaces recursively
    • Effective only if the locality assumption holds: restricts that the subspace of a cluster can be determined by the local neighborhood
    • Ex. PROCLUS (Aggarwal et al. 1999): a k-medoid-like method

CLIQUE: SubSpace Clustering with Aprori Pruning

Subspace Clustering Method (II): Correlation-Based Methods

  • Subspace search method: similarity based on distance or density
  • Correlation-based method: based on advanced correlation models
  • Ex. PCA-based approach:
    • Apply PCA (for Principal Component Analysis) to derive a set of new, uncorrelated dimensions,
    • then mine clusters in the new space or its subspaces
  • Other space transformations:
    • Hough transform
    • Fractal dimensions

Subspace Clustering Method (II): Correlation-Based Methods

  • Bi-clustering: Cluster both objects and attributes  simultaneously (treat objs and attrs in symmetric way)
  • Four requirements:
    • Only a small set of objects participate in a cluster
    • A cluster only involves a small number of attributes
    • An object may participate in multiple clusters, or does not participate in any cluster at all
    • An attribute may be involved in multiple clusters, or is not involved in any cluster at all
  • Ex 1. Gene expression or microarray data: a gene sample/condition matrix.
    • Each element in the matrix, a real number, records the expression level of a gene under a specific condition
  • Ex. 2. Clustering customers and products
    • Another bi-clustering problem

Types of Bi-clusters

  • Let A = {a1, ..., an} be a set of genes, B = {b1, …, bn} a set of conditions
  • A bi-cluster: A submatrix where genes and conditions follow some consistent patterns
  • 4 types of bi-clusters (ideal cases)
    • Bi-clusters with constant values:
      • for any i in I and j in J, eij = c
    • Bi-clusters with constant values on rows:

      • eij = c + αi
      • Also, it can be constant values on columns

    • Bi-clusters with coherent values (aka. pattern-based clusters)
      • eij = c + αi + βj
    • Bi-clusters with coherent evolutions on rows

      • (ei1j1− ei1j2)(ei2j1− ei2j2) ≥
      • i.e., only interested in the up- or down- regulated changes across genes or conditions without constraining on the exact values

Bi-Clustering Methods

  • Real-world data is noisy: Try to find approximate bi-clusters
  • Methods: Optimization-based methods vs. enumeration methods
  • Optimization-based methods
    • Try to find a submatrix at a time that achieves the best significance as a bi-cluster
    • Due to the cost in computation, greedy search is employed to find local optimal bi-clusters
    • Ex. δ-Cluster Algorithm (Cheng and Church, ISMB’2000)
  • Enumeration methods
    • Use a tolerance threshold to specify the degree of noise allowed in the bi-clusters to be mined
    • Then try to enumerate all submatrices as bi-clusters that satisfy the requirements
    • Ex. δ-pCluster Algorithm (H. Wang et al.’ SIGMOD’2002, MaPle: Pei et al., ICDM’2003)

Bi-Clustering for Micro-Array Data Analysis

  • Left figure: Micro-array “raw” data shows 3 genes and their values in a multi-D space: Difficult to find their patterns
  • Right two: Some subsets of dimensions form nice shift and scaling patterns
  • No globally defined similarity/distance measure
  • Clusters may not be exclusive
    • An object can appear in multiple clusters

Bi-Clustering (I): δ-Bi-Cluster

  • For a submatrix I x J
    • The mean of the i-th row:

\[e_{iJ}=\frac{1}{|J|} \sum_{j\epsilon J}e_{ij}\]

    • The mean of the j-th column:

\[e_{Ij}=\frac{1}{|I|} \sum_{i\epsilon I}e_{ij}\]

    • The mean of all elements in the submatrix:

\[e_{IJ}=\frac{1}{|I||J|} \sum_{i\epsilon I,j\epsilon J}e_{ij}=\frac{1}{|I|} \sum_{i\epsilon I}e_{iJ}=\frac{1}{|J|} \sum_{j\epsilon J}e_{Ij}\]

    • The quality of the submatrix as a bi-cluster can be measured by the mean squared residue value

\[H(I\times J)=\frac{1}{|I||J|} \sum_{i\epsilon I,j\epsilon J}(e_{ij}-e_{iJ}-e_{Ij}+e_{IJ})^{2}\]

  • A submatrix I x J is δ-bi-cluster if H(I x J) ≤ δ where δ ≥ 0 is a threshold. When δ = 0, I x J is a perfect bi-cluster with coherent values. By setting δ > 0, a user can specify the tolerance of average noise per element against a perfect bi-cluster
    • residue(eij) = eij eiJ eIj + eIJ

Bi-Clustering (I): The δ-Cluster Algorithm

  • Maximal δ-bi-cluster is a δ-bi-cluster I x J such that there does not exist another δ-bi-cluster I′ x J′ which contains I x J
  • Computing is costly: Use heuristic greedy search to obtain local optimal clusters
  • Two phase computation: deletion phase and additional phase
  • Deletion phase: Start from the whole matrix, iteratively remove rows and columns while the mean squared residue of the matrix is over δ
    • At each iteration, for each row/column, compute the mean squared residue:

\[d(i)=\frac{1}{J}\sum_{j\epsilon J}(e_{ij}-e_{iJ}-e_{Ij}+e_{IJ})^{2}\]

\[d(j)=\frac{1}{I}\sum_{i\epsilon I}(e_{ij}-e_{iJ}-e_{Ij}+e_{IJ})^{2}\]

    • Remove the row or column of the largest mean squared residue
  • Addition phase:
    • Expand iteratively the δ-bi-cluster I x J obtained in the deletion phase as long as the δ-bi-cluster requirement is maintained
    • Consider all the rows/columns not involved in the current bi-cluster I x J by calculating their mean squared residues
    • A row/column of the smallest mean squared residue is added into the current δ-bi-cluster
  • It finds only one δ-bi-cluster, thus needs to run multiple times: replacing the elements in the output bi-cluster by random numbers

Bi-Clustering (II): δ-pCluster

  • Enumerating all bi-clusters (δ-pClusters) [H. Wang, et al., Clustering by pattern similarity in large data sets. SIGMOD’02]
  • Since a submatrix I x J is a bi-cluster with (perfect) coherent values iff ei1j1 − ei2j1 = ei1j2 − ei2j2. For any 2 x 2 submatrix of I x J, define p-score

\[p-score \begin{pmatrix} e_{i1j1} &e_{i1j2} \\ e_{i2j1}&e_{i2j2} \end{pmatrix} =|(e_{i1j1}-e_{i2j1})-(e_{i1j2}-e_{i2j2})|\]

  • A submatrix I x J is a δ-pCluster (pattern-based cluster) if the p-score of every 2 x 2 submatrix of I x J is at most δ, where δ 0 is a threshold specifying a user's tolerance of noise against a perfect bi-cluster
  • The p-score controls the noise on every element in a bi-cluster, while the mean squared residue captures the average noise
  • Monotonicity: If I x J is a δ-pClusters, every x x y (x,y ≥ 2) submatrix of I x J is also a δ-pClusters.
  • A δ-pCluster is maximal if no more row or column can be added into the cluster and retain δ-pCluster: We only need to compute all maximal δ-pClusters.

MaPle: Efficient Enumeration of δ-pClusters

  • Pei et al., MaPle: Efficient enumerating all maximal δ-pClusters. ICDM'03
  • Framework: Same as pattern-growth in frequent pattern mining (based on the downward closure property)
  • For each condition combination J, find the maximal subsets of genes I such that I x J is a δ-pClusters
    • If I x J is not a submatrix of another δ-pClusters
    • then I x J is a maximal δ-pCluster.
  • Algorithm is very similar to mining frequent closed itemsets
  • Additional advantages of δ-pClusters:
    • Due to averaging of δ-cluster, it may contain outliers but still within δ-threshold
    • Computing bi-clusters for scaling patterns, take logarithmic on
      • will lead to the p-score form

\[\frac{d_{xa}/d_{ya}}{d_{xb}/d_{yb}}<\delta \]

Dimensionality-Reduction Methods

  • Dimensionality reduction: In some situations, it is more effective to construct a new space instead of using some subspaces of the original data

  • Ex. To cluster the points in the following figure, any subspace of the original one, X and Y, cannot help, since all the three clusters will be projected into the overlapping areas in X and Y axes.

    • Construct a new dimension as the dashed one, the three clusters become apparent when the points projected into the new dimension
  • Dimensionality reduction methods
    • Feature selection and extraction: But may not focus on clustering structure finding
    • Spectral clustering: Combining feature extraction and clustering (i.e., use the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions)
      • Normalized Cuts (Shi and Malik, CVPR’97 or PAMI’2000)
      • The Ng-Jordan-Weiss algorithm (NIPS’01)

Spectral Clustering: The Ng-Jordan-Weiss (NJW) Algorithm

  • Given a set of objects o1, …, on, and the distance between each pair of objects, dist(oi, oj), find the desired number k of clusters
  • Calculate an affinity matrix W, where σ is a scaling parameter that controls how fast the affinity Wij decreases as dist(oi, oj) increases. In NJW, set Wij = 0

\[D_{ii}=\sum_{j=1}^{n}W_{ij}\]

  • Derive a matrix A = f(W). NJW defines a matrix D to be a diagonal matrix s.t. Dii is the sum of the i-th row of W, i.e.,

\[W_{ij}=e^{-\frac{dist(o_{i},o_{j})}{\sigma^{2}}}\]

      • Then, A is set to

\[A=D^{-\frac{1}{2}}WD^{-\frac{1}{2}}\]

  • A spectral clustering method finds the k leading eigenvectors of A
    • A vector v is an eigenvector of matrix A if Av = λv, where λ is the corresponding eigen-value
  • Using the k leading eigenvectors, project the original data into the new space defined by the k leading eigenvectors, and run a clustering algorithm, such as k-means, to find k clusters
  • Assign the original data points to clusters according to how the transformed points are assigned in the clusters obtained

Spectral Clustering: Illustration and Comments

  • Spectral clustering: Effective in tasks like image processing 
  • Scalability challenge: Computing eigenvectors on a large matrix is costly
  • Can be combined with other clustering methods, such as bi-clustering

Clustering Graphs and Network Data

  • Applications
    • Bi-partite graphs, e.g., customers and products, authors and conferences
    • Web search engines, e.g., click through graphs and Web graphs
    • Social networks, friendship/coauthor graphs
  • Similarity measures
    • Geodesic distances
    • Distance based on random walk (SimRank)
  • Graph clustering methods
    • Minimum cuts: FastModularity (Clauset, Newman & Moore, 2004)
    • Density-based clustering: SCAN (Xu et al., KDD’2007)

Similarity Measure (I): Geodesic Distance

  • Geodesic distance (A, B): length (i.e., # of edges) of the shortest path between A and B (if not connected, defined as infinite)
  • Eccentricity of v, eccen(v): The largest geodesic distance between v and any other vertex u ∈ V − {v}.
    • E.g., eccen(a) = eccen(b) = 2; eccen(c) = eccen(d) = eccen(e) = 3
  • Radius of graph G: The minimum eccentricity of all vertices, i.e., the distance between the “most central point” and the “farthest border”
    • r = min v∈V eccen(v)
    • E.g., radius (g) = 2
  • Diameter of graph G: The maximum eccentricity of all vertices, i.e., the largest distance between any pair of vertices in G
    • d = max v∈V eccen(v)
    • E.g., diameter (g) = 3
  • A peripheral vertex is a vertex that achieves the diameter.
    • E.g., Vertices c, d, and e are peripheral vertices

SimRank: Similarity Based on Random Walk and Structural Context

  • SimRank: structural-context similarity, i.e., based on the similarity of its neighbors
  • In a directed graph G = (V,E),
    • individual in-neighborhood of v: I(v) = {u | (u, v) ∈ E}
    • individual out-neighborhood of v: O(v) = {w | (v, w) ∈ E}
  • Similarity in SimRank: 

\[s(u,v)=\frac{C}{|I(u)||I(v)|}\sum_{x\epsilon I(u)}\sum_{y\epsilon I(v) }s(x,y) \]

    • Initialization:

\[ s_{0}(u,v)=\left\{\begin{matrix}0, if u\neq v \\ 1,if u=v \end{matrix}\right. \]

    • Then we can compute si+1 from si based on the definition
  • Similarity based on random walk: in a strongly connected component
    • Expected distance:

\[d(u,v)=\sum_{t:(u\rightarrow v)}P[t]l(t)\]


SimRank: Similarity Based on Random Walk and Structural Context (cont')

    • Expected meeting distance:

\[m(u,v)=\sum_{t:(u,v)\rightarrow (x,x)}P[t]l(t)\]

    • Expected meeting probability:

\[p(u,v)=\sum_{t:(u,v)\rightarrow (x,x)}P[t]C^{l(t)}\]

P[t] is the probability of the tour

\[ P[t]=\left\{\begin{matrix}\prod_{i=1}^{k-1}\frac{1}{|O(w_{i})|}, if l(t)>0 \\0,if l(t)=0 \end{matrix}\right. \]

Graph Clustering: Sparsest Cut

  • G = (V,E). The cut set of a cut is the set of edges {(u, v) ∈ E | u ∈ S, v ∈ T } and S and T are in two partitions

  • Size of the cut: # of edges in the cut set
  • Min-cut (e.g., C1) is not a good partition
  • A better measure: Sparsity:

φ=the size of the cut / min{|S|,|T|}

  • A cut is sparsest if its sparsity is not greater than that of any other cut
  • Ex. Cut C2 = ({a, b, c, d, e, f, l}, {g, h, i, j, k}) is the sparsest cut
  • For k clusters, the modularity o a clustering assesses the quality of the clustering: 

\[Q=\sum_{i=1}^{k}(\frac{l_{i}}{|E|}-(\frac{d_{i}}{2|E|})^{2})\]

li: # edges between vertices in the i-th cluster

di: the sum of the degrees of the vertices in the i-th cluster

  • The modularity of a clustering of a graph is the difference between the fraction of all edges that fall into individual clusters and the fraction that would do so if the graph vertices were randomly connected
  • The optimal clustering of graphs maximizes the modularity

Graph Clustering: Challenges of Finding Good Cuts

  • High computational cost
    • Many graph cut problems are computationally expensive
    • The sparsest cut problem is NP-hard
    • Need to tradeoff between efficiency/scalability and quality
  • Sophisticated graphs
    • May involve weights and/or cycles.
  • High dimensionality
    • A graph can have many vertices. In a similarity matrix, a vertex is represented as a vector (a row in the matrix) whose dimensionality is the number of vertices in the graph
  • Sparsity
    • A large graph is often sparse, meaning each vertex on average connects to only a small number of other vertices
    • A similarity matrix from a large sparse graph can also be sparse

Two Approaches for Graph Clustering

  • Two approaches for clustering graph data
    • Use generic clustering methods for high-dimensional data
    • Designed specifically for clustering graphs
  • Using clustering methods for high-dimensional data
    • Extract a similarity matrix from a graph using a similarity measure
    • A generic clustering method can then be applied on the similarity matrix to discover clusters
    • Ex. Spectral clustering: approximate optimal graph cut solutions
  • Methods specific to graphs
    • Search the graph to find well-connected components as clusters
    • Ex. SCAN (Structural Clustering Algorithm for Networks)
      • X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A Structural Clustering Algorithm for Networks”, KDD'07

SCAN: Density-Based Clustering of Networks

  • How many clusters?
  • What size should they be?
  • What is the best partitioning?
  • Should some points be segregated?

An Example Network

  • Application: Given simply information of who associates with whom, could one identify clusters of individuals with common interests or special relationships (families, cliques, terrorist cells)?

A Social Network Model

  • Cliques, hubs and outliers
    • Individuals in a tight social group, or clique, know many of the same people, regardless of the size of the group
    • Individuals who are hubs know many people in different groups but belong to no single group. Politicians, for example bridge multiple groups
    • Individuals who are outliers reside at the margins of society. Hermits, for example, know few people and belong to no group
  • The Neighborhood of a Vertex
    • Define Γ(v) as the immediate neighborhood of a vertex (i.e. the set of people that an individual knows )
  • Structure Similarity

    • The desired features tend to be captured by a measure we call Structural Similarity
      \[\sigma(v,w)=\frac{|\tau (v)\bigcap \tau (w)|}{\sqrt{|\tau (v)|| \tau (w)|}} \]

    • Structural similarity is large for members of a clique and small for hubs and outliers

    Structural Connectivity [1]

    • ε-Neighborhood:

    \[ N_{\varepsilon }(v)=\left \{ w\epsilon \tau (v)|\sigma (v,w)\geq \varepsilon \right \} \]

    • Core:

    \[CORE_{\varepsilon,\mu  }(v)\Leftrightarrow |N_{\varepsilon } (v)|\geq \mu \]

    • Direct structure reachable:

    \[DirRECH_{\varepsilon,\mu  }(v,w)\Leftrightarrow CORE_{\varepsilon,\mu  }(v)\wedge w\epsilon N_{\varepsilon } (v)\]

    • Structure reachable: transitive closure of direct structure reachability
    • Structure connected:

    \[CONNECT_{\varepsilon,\mu  }(v,w)\Leftrightarrow \exists u\epsilon V:RECH_{\varepsilon,\mu  }(u,v)\wedge RECH_{\varepsilon,\mu  }(u,w)\]

    Structure-Connected Clusters

    • Structure-connected cluster C
      • Connectivity:

    \[\forall v,w\epsilon C:CONNECT_{\varepsilon ,\mu}(v,w)\]

      • Maximality:

    \[\forall v,w\epsilon V:v\epsilon C\wedge REACH_{\varepsilon ,\mu}(v,w)\Rightarrow w\epsilon C\]

    • Hubs:
      • Not belong to any cluster
      • Bridge to many clusters
    • Outliers:
      • Not belong to any cluster
      • Connect to less clusters

    Algorithm

    Algorithm (cont')

    Algorithm (cont')

    Algorithm (cont')

    Algorithm (cont')

    Algorithm (cont')

    Algorithm (cont')

    Algorithm (cont')

    Algorithm (cont')

    Algorithm (cont')

    Algorithm (cont')

    Algorithm (cont')