Partitioning Algorithms: Basic Concept

  • Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci)

\[E=\sum_{i=1}^{k}\sum_{p\epsilon C_{i}}(d(p,c_{i}))^{2}\]

  • Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
    • Global optimal: exhaustively enumerate all partitions
    • Heuristic methods: k-means and k-medoids algorithms
    • k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster
    • k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

The K-Means Clustering Method

  • Given k, the k-means algorithm is implemented in four steps:
    • Partition objects into k nonempty subsets
    • Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster)
    • Assign each object to the cluster with the nearest seed point
    • Go back to Step 2, stop when the assignment does not change

An Example of K-Means Clustering

Comments on the K-Means Method

  • Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.
      • Comparing: 

\[PAM: O(k(n-k)^{2}), CLARA: O(ks^{2} + k(n-k))\]

  • Comment: Often terminates at a local optimal
  • Weakness
    • Applicable only to objects in a continuous n-dimensional space
      • Using the k-modes method for categorical data
      • In comparison, k-medoids can be applied to a wide range of data
    • Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009)
    • Sensitive to noisy data and outliers
    • Not suitable to discover clusters with non-convex shapes

Variations of the K-Means Method

  • Most of the variants of the k-means which differ in
    • Selection of the initial k means
    • Dissimilarity calculations
    • Strategies to calculate cluster means
  • Handling categorical data: k-modes
    • Replacing means of clusters with modes
    • Using new dissimilarity measures to deal with categorical objects
    • Using a frequency-based method to update modes of clusters
    • A mixture of categorical and numerical data: k-prototype method

What Is the Problem of the K-Means Method?

  • The k-means algorithm is sensitive to outliers !
    • Since an object with an extremely large value may substantially distort the distribution of the data
  • K-Medoids:  Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster

The K-Medoid Clustering Method

  • K-Medoids Clustering: Find representative objects (medoids) in clusters
    • PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
      • Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering
      • PAM works effectively for small data sets, but does not scale well for large data sets (due to the computational complexity)
  • Efficiency improvement on PAM
    • CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
    • CLARANS (Ng & Han, 1994): Randomized re-sampling

PAM (Partitioning Around Medoids) (1987)

  • PAM (Kaufman and Rousseeuw, 1987), built in Splus
  • Use real object to represent the cluster
    • Select k representative objects arbitrarily
    • For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih
    • For each pair of i and h,
      • If TCih < 0, i is replaced by h
      • Then assign each non-selected object to the most similar representative object
    • repeat steps 2-3 until there is no change

PAM: A Typical K-Medoids Algorithm

PAM Clustering: Finding the Best Cluster Center

  • Case 1: p currently belongs to oj. If oj is replaced by orandom as a representative object and p is the closest to one of the other representative object oi, then p is reassigned to oi

What Is the Problem with PAM?

  • Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean
  • Pam works efficiently for small data sets but does not scale well for large data sets.
    • O(k(n-k)^2 ) for each iteration
      where n is # of data,k is # of clusters
  • Sampling-based method
    CLARA(Clustering LARge Applications)

CLARA (Clustering Large Applications) (1990)

  • CLARA (Kaufmann and Rousseeuw in 1990)
    • Built in statistical analysis packages, such as SPlus
    • It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output
  • Strength: deals with larger data sets than PAM
  • Weakness:
    • Efficiency depends on the sample size
    • A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased

CLARANS (“Randomized” CLARA) (1994)

  • CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94)
    • Draws sample of neighbors dynamically
    • The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids
    • If the local optimum is found, it starts with new randomly selected node in search for a new local optimum
  • Advantages: More efficient and scalable than both PAM and CLARA
  • Further improvement: Focusing techniques and spatial access structures (Ester et al.’95)

ROCK: Clustering Categorical Data

  • ROCK: RObust Clustering using linKs
    • S. Guha, R. Rastogi & K. Shim, ICDE’99
  • Major ideas
    • Use links to measure similarity/proximity
    • Not distance-based
  • Algorithm: sampling-based clustering
    • Draw random sample
    • Cluster with links
    • Label data in disk
  • Experiments
    • Congressional voting, mushroom data

Similarity Measure in ROCK

  • Traditional measures for categorical data may not work well, e.g., Jaccard coefficient
  • Example: Two groups (clusters) of transactions
    • C1. : {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
    • C2. : {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
  • Jaccard co-efficient may lead to wrong clustering result
    • C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
    • C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
  • Jaccard co-efficient-based similarity function:

\[Sim(T_{1},T_{2})=\frac{|T_{1} \bigcap T_{2}|}{|T_{1} \bigcup T_{2}|}\]

    • Ex. Let T1 = {a, b, c}, T2 = {c, d, e}

\[Sim(T_{1},T_{2})=\frac{|\left \{ c \right \}|}{|a,b,c,d,e|}=\frac{1}{5}=0.2\]

Link Measure in ROCK

  • Clusters
    • C1:: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
    • C2: : {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
  • Neighbors
    • Two transactions are neighbors if sim(T1,T2) > threshold
    • Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}
      • T1 connected to: {a,b,d}, {a,b,e}, {a,c,d}, {a,c,e}, {b,c,d}, {b,c,e}, {a,b,f}, {a,b,g}
      • T2 connected to: {a,c,d}, {a,c,e}, {a,d,e}, {b,c,e}, {b,d,e}, {b,c,d}
      • T3 connected to: {a,b,c}, {a,b,d}, {a,b,e}, {a,b,g}, {a,f,g}, {b,f,g}
  • Link Similarity
    • Link similarity between two transactions is the # of common neighbors
    • link(T1, T2) = 4, since they have 4 common neighbors
      • {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
    • link(T1, T3) = 3, since they have 3 common neighbors
      • {a, b, d}, {a, b, e}, {a, b, g}

Rock Algorithm

  • Method
    • Compute similarity matrix
      • Use link similarity
    • Run agglomerative hierarchical clustering
    • When the data set is big
      • Get sample of transactions
      • Cluster sample
  • Problems:
    • Guarantee cluster interconnectivity
      • any two transactions in a cluster are very well connected
    • Ignores information about closeness of two clusters
      • two separate clusters may still be quite connected