### Partitioning Algorithms: Basic Concept

- Partitioning method: Partitioning a database
*D*of*n*objects into a set of*k*clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci)

\[E=\sum_{i=1}^{k}\sum_{p\epsilon C_{i}}(d(p,c_{i}))^{2}\]

- Given
*k*, find a partition of*k clusters*that optimizes the chosen partitioning criterion - Global optimal: exhaustively enumerate all partitions
- Heuristic methods:
*k-means*and*k-medoids*algorithms *k-means*(MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster*k-medoids*or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

### The *K-Means* Clustering Method

- Given
*k*, the*k-means*algorithm is implemented in four steps: - Partition objects into
*k*nonempty subsets - Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e.,
*mean point*, of the cluster) - Assign each object to the cluster with the nearest seed point
- Go back to Step 2, stop when the assignment does not change

### An Example of K-Means Clustering

### Comments on the K-Means Method

- Strength:
*Efficient*:*O*(*tkn*), where*n*is # objects,*k*is # clusters, and*t*is # iterations. Normally,*k*,*t*<<*n*. - Comparing:

\[PAM: O(k(n-k)^{2}), CLARA: O(ks^{2} + k(n-k))\]

- Comment: Often terminates at a
*local optimal* - Weakness
- Applicable only to objects in a continuous n-dimensional space
- Using the k-modes method for categorical data
- In comparison, k-medoids can be applied to a wide range of data
- Need to specify
*k,*the*number*of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009) - Sensitive to noisy data and
*outliers* - Not suitable to discover clusters with
*non-convex shapes*

### Variations of the K-Means Method

- Most of the variants of the
*k-means*which differ in - Selection of the initial
*k*means - Dissimilarity calculations
- Strategies to calculate cluster means
- Handling categorical data:
*k-modes* - Replacing means of clusters with modes
- Using new dissimilarity measures to deal with categorical objects
- Using a frequency-based method to update modes of clusters
- A mixture of categorical and numerical data:
*k-prototype*method

### What Is the Problem of the K-Means Method?

- The k-means algorithm is sensitive to outliers !
- Since an object with an extremely large value may substantially distort the distribution of the data

- Since an object with an extremely large value may substantially distort the distribution of the data
- K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster

### The K-Medoid Clustering Method

*K*-*Medoids*Clustering: Find*representative*objects (medoids) in clusters*PAM*(Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)- Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering
*PAM*works effectively for small data sets, but does not scale well for large data sets (due to the computational complexity)- Efficiency improvement on PAM
*CLARA*(Kaufmann & Rousseeuw, 1990): PAM on samples*CLARANS*(Ng & Han, 1994): Randomized re-sampling

### PAM (Partitioning Around Medoids) (1987)

- PAM (Kaufman and Rousseeuw, 1987), built in Splus
- Use real object to represent the cluster
- Select
*k*representative objects arbitrarily - For each pair of non-selected object
*h*and selected object*i*, calculate the total swapping cost*TC**ih* - For each pair of
*i*and*h*, - If
*TC**ih*< 0,*i*is replaced by*h* - Then assign each non-selected object to the most similar representative object
- repeat steps 2-3 until there is no change

### PAM: A Typical K-Medoids Algorithm

### PAM Clustering: Finding the Best Cluster Center

- Case 1: p currently belongs to oj. If oj is replaced by orandom as a representative object and p is the closest to one of the other representative object oi, then p is reassigned to oi

### What Is the Problem with PAM?

- Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean
- Pam works efficiently for small data sets but does not
**scale well**for large data sets. - O(k(n-k)^2 ) for each iteration

where n is # of data,k is # of clusters - Sampling-based method

CLARA(Clustering LARge Applications)

*CLARA* (Clustering Large Applications) (1990)

*CLARA*(Kaufmann and Rousseeuw in 1990)- Built in statistical analysis packages, such as SPlus
- It draws
*multiple samples*of the data set, applies*PAM*on each sample, and gives the best clustering as the output - Strength: deals with larger data sets than
*PAM* - Weakness:
- Efficiency depends on the sample size
- A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased

*CLARANS *(“Randomized” CLARA)* (1994)*

*CLARANS*(A Clustering Algorithm based on Randomized Search) (Ng and Han’94)- Draws sample of neighbors dynamically
- The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of
*k*medoids - If the local optimum is found,
*it*starts with new randomly selected node in search for a new local optimum - Advantages: More efficient and scalable than both
*PAM*and*CLARA* - Further improvement: Focusing techniques and spatial access structures (Ester et al.’95)

### ROCK: Clustering Categorical Data

- ROCK: RObust Clustering using linKs
- S. Guha, R. Rastogi & K. Shim, ICDE’99
- Major ideas
- Use links to measure similarity/proximity
- Not distance-based
- Algorithm: sampling-based clustering
- Draw random sample
- Cluster with links
- Label data in disk
- Experiments
- Congressional voting, mushroom data

### Similarity Measure in ROCK

- Traditional measures for categorical data may not work well, e.g., Jaccard coefficient
- Example: Two groups (clusters) of transactions
- C1. : {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
- C2. : {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
- Jaccard co-efficient may lead to wrong clustering result
- C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
- C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
- Jaccard co-efficient-based similarity function:

\[Sim(T_{1},T_{2})=\frac{|T_{1} \bigcap T_{2}|}{|T_{1} \bigcup T_{2}|}\]

- Ex. Let
*T1*= {a, b, c},*T2*= {c, d, e}

\[Sim(T_{1},T_{2})=\frac{|\left \{ c \right \}|}{|a,b,c,d,e|}=\frac{1}{5}=0.2\]

### Link Measure in ROCK

- Clusters
- C1:: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
- C2: : {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
- Neighbors
- Two transactions are neighbors if sim(T1,T2) > threshold
- Let
*T**1*= {a, b, c},*T**2*= {c, d, e}, T3 = {a, b, f} *T**1*connected to: {a,b,d}, {a,b,e}, {a,c,d}, {a,c,e}, {b,c,d}, {b,c,e}, {a,b,f}, {a,b,g}*T**2*connected to: {a,c,d}, {a,c,e}, {a,d,e}, {b,c,e}, {b,d,e}, {b,c,d}- T3 connected to: {a,b,c}, {a,b,d}, {a,b,e}, {a,b,g}, {a,f,g}, {b,f,g}
- Link Similarity
- Link similarity between two transactions is the # of common neighbors
- link(
*T**1,**T**2**) = 4, since they have 4 common neighbors* - {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
- link(
*T**1,**T**3**) = 3, since they have 3 common neighbors* - {a, b, d}, {a, b, e}, {a, b, g}

### Rock Algorithm

- Method
- Compute similarity matrix
- Use link similarity
- Run agglomerative hierarchical clustering
- When the data set is big
- Get sample of transactions
- Cluster sample
- Problems:
- Guarantee cluster interconnectivity
- any two transactions in a cluster are very well connected
- Ignores information about closeness of two clusters
- two separate clusters may still be quite connected