### Why Constraint-Based Cluster Analysis?

• Need user feedback: Users know their applications the best
• Less parameters but more user-desired constraints, e.g., an ATM allocation problem: obstacle & desired clusters

### Categorization of Constraints

• Constraints on instances: specifies how a pair or a set of instances should be grouped in the cluster analysis
• must-link(x, y): x and y should be grouped into one cluster
• Constraints can be defined using variables, e.g.,
• cannot-link(x, y) if dist(x, y) > d
• Constraints on clusters: specifies a requirement on the clusters
• E.g., specify the min # of objects in a cluster, the max diameter of a cluster, the shape of a cluster (e.g., a convex), # of clusters (e.g., k)
• Constraints on similarity measurements: specifies a requirement that the similarity calculation must respect
• E.g., driving on roads, obstacles (e.g., rivers, lakes)
• Issues: Hard vs. soft constraints; conflicting or redundant constraints

### Constraint-Based Clustering Methods (I):Handling Hard Constraints

• Handling hard constraints: Strictly respect the constraints in cluster assignments
• Example: The COP-k-means algorithm
• Generate super-instances for must-link constraints
• Compute the transitive closure of the must-link constraints
• To represent such a subset, replace all those objects in the subset by the mean.
• The super-instance also carries a weight, which is the number of objects it represents
• Conduct modified k-means clustering to respect cannot-link constraints
• Modify the center-assignment process in k-means to a nearest feasible center assignment
• An object is assigned to the nearest center so that the assignment respects all cannot-link constraints

### Constraint-Based Clustering Methods (II):Handling Soft Constraints

• Treated as an optimization problem: When a clustering violates a soft constraint, a penalty is imposed on the clustering
• Overall objective: Optimizing the clustering quality, and minimizing the constraint violation penalty
• Ex. CVQE (Constrained Vector Quantization Error) algorithm: Conduct k-means clustering while enforcing constraint violation penalties
• Objective function: Sum of distance used in k-means, adjusted by the constraint violation penalties
• Penalty of a must-link violation
• If objects x and y must-be-linked but they are assigned to two different centers, c1 and c2, dist(c1, c2) is added to the objective function as the penalty
• Penalty of a cannot-link violation
• If objects x and y cannot-be-linked but they are assigned to a common center c, dist(c, c′), between c and c′ is added to the objective function as the penalty, where c′ is the closest cluster to c that can accommodate x or y

### Speeding Up Constrained Clustering

• It is costly to compute some constrained clustering
• Ex. Clustering with obstacle objects: Tung, Hou, and Han. Spatial clustering in the presence of obstacles, ICDE'01
• K-medoids is more preferable since k-means may locate the ATM center in the middle of a lake
• Visibility graph and shortest path
• Triangulation and micro-clustering
• Two kinds of join indices (shortest-paths) worth pre-computation
• VV index: indices for any pair of obstacle vertices
• MV index: indices for any pair of micro-cluster and obstacle indices

### User-Guided Clustering: A Special Kind of Constraints

• X. Yin, J. Han, P. S. Yu, “Cross-Relational Clustering with User's Guidance”, KDD'05
• User usually has a goal of clustering, e.g., clustering students by research area
• User specifies his clustering goal to CrossClus

### Comparing with Classification

 User-specified feature (in the form of attribute) is used as a hint, not class labelsThe attribute may contain too many or too few distinct values, e.g., a user may want to cluster students into 20 clusters instead of 3Additional features need to be included in cluster analysis

### Comparing with Semi-Supervised Clustering

• Semi-supervised clustering: User provides a training set consisting of “similar” (“must-link) and “dissimilar” (“cannot link”) pairs of objects
• User-guided clustering: User specifies an attribute as a hint, and more relevant features are found for clustering

### Why Not Semi-Supervised Clustering?

• Much information (in multiple relations) is needed to judge whether two tuples are similar
• A user may not be able to provide a good training set
• It is much easier for a user to specify an attribute as a hint, such as a student’s research area

### CrossClus: An Overview

• Measure similarity between features by how they group objects into clusters
• Use a heuristic method to search for pertinent features
• Start from user-specified feature and gradually expand search range
• Use tuple ID propagation to create feature values
• Features can be easily created during the expansion of search range, by propagating IDs
• Explore three clustering algorithms: k-means, k-medoids, and hierarchical clustering

### Multi-Relational Features

• A multi-relational feature is defined by:
• A join path, e.g., Student → Register → OpenCourse → Course
• An attribute, e.g., Course.area
• (For numerical feature) an aggregation operator, e.g., sum or average
• Categorical feature f = [Student → Register → OpenCourse → Course, Course.area, null]

### Representing Features

• Similarity between tuples t 1 and t 2 w.r.t. categorical feature f
• Cosine similarity between vectors f ( t 1) and f ( t 2)

$sim_f(t_{1},t_{2})=\frac{\sum_{k=1}^{L}f(t_{1}).P_{k}.f(t_{2}).P_{k}}{\sqrt{\sum_{k=1}^{L}f(t_{1}).P_{k}^{2}}.\sqrt{\sum_{k=1}^{L}f(t_{2}).P_{k}^{2}}}$

• Most important information of a feature f is how f groups tuples into clusters
• f is represented by similarities between every pair of tuples indicated by f
• The horizontal axes are the tuple indices, and the vertical axis is the similarity
• This can be considered as a vector of N x N dimensions

### Searching for Pertinent Features

• Different features convey different aspects of information

• Features conveying same aspect of information usually cluster tuples in more similar ways
• Research group areas vs. conferences of publications
• Given user specified feature
• Find pertinent features by computing feature similarity

### Heuristic Search for Pertinent Features

• Overall procedure

1. Start from the user- specified feature

2. Search in neighborhood of existing pertinent features

• Tuple ID propagation is used to create multi-relational features
• IDs of target tuples can be propagated along any join path, from which we can find tuples joinable with each target tuple

### Clustering with Multi-Relational Features

• Given a set of L pertinent features f1, …, fL, similarity between two tuples

$sim(t_{1},t_{2})=\sum_{i=1}^{L}sim_{f_{i}}(t_{1},t_{2}).f_{i}.weight$

• Weight of a feature is determined in feature search by its similarity with other pertinent features
• Clustering methods
• CLARANS [Ng & Han 94], a scalable clustering algorithm for non-Euclidean space
• K-means
• Agglomerative hierarchical clustering

### Experiments: Compare CrossClus with

• Baseline: Only use the user specified feature
• PROCLUS [Aggarwal, et al. 99]: a state-of-the-art subspace clustering algorithm
• Use a subset of features for each cluster
• We convert relational database to a table by propositionalization
• User-specified feature is forced to be used in every cluster
• RDBC [Kirsten and Wrobel’00]
• A representative ILP clustering algorithm
• Use neighbor information of objects for clustering
• User-specified feature is forced to be used

### Measure of Clustering Accuracy

•  Accuracy
• Measured by manually labeled data
• We manually assign tuples into clusters according to their properties (e.g., professors in different research areas)
• Accuracy of clustering: Percentage of pairs of tuples in the same cluster that share common label
• This measure favors many small clusters
• We let each approach generate the same number of clusters