### Partitioning Algorithms: Basic Concept

• Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci)

$E=\sum_{i=1}^{k}\sum_{p\epsilon C_{i}}(d(p,c_{i}))^{2}$

• Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster
• k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

### The K-Means Clustering Method

• Given k, the k-means algorithm is implemented in four steps:
• Partition objects into k nonempty subsets
• Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster)
• Assign each object to the cluster with the nearest seed point
• Go back to Step 2, stop when the assignment does not change

### Comments on the K-Means Method

• Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.
• Comparing:

$PAM: O(k(n-k)^{2}), CLARA: O(ks^{2} + k(n-k))$

• Comment: Often terminates at a local optimal
• Weakness
• Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
• Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009)
• Sensitive to noisy data and outliers
• Not suitable to discover clusters with non-convex shapes

### Variations of the K-Means Method

• Most of the variants of the k-means which differ in
• Selection of the initial k means
• Dissimilarity calculations
• Strategies to calculate cluster means
• Handling categorical data: k-modes
• Replacing means of clusters with modes
• Using new dissimilarity measures to deal with categorical objects
• Using a frequency-based method to update modes of clusters
• A mixture of categorical and numerical data: k-prototype method

### What Is the Problem of the K-Means Method?

• The k-means algorithm is sensitive to outliers !
• Since an object with an extremely large value may substantially distort the distribution of the data
• K-Medoids:  Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster

### The K-Medoid Clustering Method

• K-Medoids Clustering: Find representative objects (medoids) in clusters
• PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
• Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering
• PAM works effectively for small data sets, but does not scale well for large data sets (due to the computational complexity)
• Efficiency improvement on PAM
• CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
• CLARANS (Ng & Han, 1994): Randomized re-sampling

### PAM (Partitioning Around Medoids) (1987)

• PAM (Kaufman and Rousseeuw, 1987), built in Splus
• Use real object to represent the cluster
• Select k representative objects arbitrarily
• For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih
• For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most similar representative object
• repeat steps 2-3 until there is no change

### PAM Clustering: Finding the Best Cluster Center

• Case 1: p currently belongs to oj. If oj is replaced by orandom as a representative object and p is the closest to one of the other representative object oi, then p is reassigned to oi

### What Is the Problem with PAM?

• Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean
• Pam works efficiently for small data sets but does not scale well for large data sets.
• O(k(n-k)^2 ) for each iteration
where n is # of data,k is # of clusters
• Sampling-based method
CLARA(Clustering LARge Applications)

### CLARA (Clustering Large Applications) (1990)

• CLARA (Kaufmann and Rousseeuw in 1990)
• Built in statistical analysis packages, such as SPlus
• It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output
• Strength: deals with larger data sets than PAM
• Weakness:
• Efficiency depends on the sample size
• A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased

### CLARANS (“Randomized” CLARA) (1994)

• CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94)
• Draws sample of neighbors dynamically
• The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids
• If the local optimum is found, it starts with new randomly selected node in search for a new local optimum
• Advantages: More efficient and scalable than both PAM and CLARA
• Further improvement: Focusing techniques and spatial access structures (Ester et al.’95)

### ROCK: Clustering Categorical Data

• ROCK: RObust Clustering using linKs
• S. Guha, R. Rastogi & K. Shim, ICDE’99
• Major ideas
• Use links to measure similarity/proximity
• Not distance-based
• Algorithm: sampling-based clustering
• Draw random sample
• Label data in disk
• Experiments
• Congressional voting, mushroom data

### Similarity Measure in ROCK

• Traditional measures for categorical data may not work well, e.g., Jaccard coefficient
• Example: Two groups (clusters) of transactions
• C1. : {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
• C2. : {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
• Jaccard co-efficient may lead to wrong clustering result
• C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
• C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
• Jaccard co-efficient-based similarity function:

$Sim(T_{1},T_{2})=\frac{|T_{1} \bigcap T_{2}|}{|T_{1} \bigcup T_{2}|}$

• Ex. Let T1 = {a, b, c}, T2 = {c, d, e}

$Sim(T_{1},T_{2})=\frac{|\left \{ c \right \}|}{|a,b,c,d,e|}=\frac{1}{5}=0.2$

• Clusters
• C1:: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
• C2: : {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
• Neighbors
• Two transactions are neighbors if sim(T1,T2) > threshold
• Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}
• T1 connected to: {a,b,d}, {a,b,e}, {a,c,d}, {a,c,e}, {b,c,d}, {b,c,e}, {a,b,f}, {a,b,g}
• T2 connected to: {a,c,d}, {a,c,e}, {a,d,e}, {b,c,e}, {b,d,e}, {b,c,d}
• T3 connected to: {a,b,c}, {a,b,d}, {a,b,e}, {a,b,g}, {a,f,g}, {b,f,g}
• Link similarity between two transactions is the # of common neighbors
• link(T1, T2) = 4, since they have 4 common neighbors
• {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
• link(T1, T3) = 3, since they have 3 common neighbors
• {a, b, d}, {a, b, e}, {a, b, g}

### Rock Algorithm

• Method
• Compute similarity matrix