Why Constraint-Based Cluster Analysis?

  • Need user feedback: Users know their applications the best
  • Less parameters but more user-desired constraints, e.g., an ATM allocation problem: obstacle & desired clusters

Categorization of Constraints

  • Constraints on instances: specifies how a pair or a set of instances should be grouped in the cluster analysis
    • Must-link vs. cannot link constraints
      • must-link(x, y): x and y should be grouped into one cluster
    • Constraints can be defined using variables, e.g.,
      • cannot-link(x, y) if dist(x, y) > d
  • Constraints on clusters: specifies a requirement on the clusters
    • E.g., specify the min # of objects in a cluster, the max diameter of a cluster, the shape of a cluster (e.g., a convex), # of clusters (e.g., k)
  • Constraints on similarity measurements: specifies a requirement that the similarity calculation must respect
    • E.g., driving on roads, obstacles (e.g., rivers, lakes)
  • Issues: Hard vs. soft constraints; conflicting or redundant constraints

Constraint-Based Clustering Methods (I):Handling Hard Constraints

  • Handling hard constraints: Strictly respect the constraints in cluster assignments
  • Example: The COP-k-means algorithm
    • Generate super-instances for must-link constraints
      • Compute the transitive closure of the must-link constraints
      • To represent such a subset, replace all those objects in the subset by the mean.
      • The super-instance also carries a weight, which is the number of objects it represents
    • Conduct modified k-means clustering to respect cannot-link constraints
      • Modify the center-assignment process in k-means to a nearest feasible center assignment
      • An object is assigned to the nearest center so that the assignment respects all cannot-link constraints

Constraint-Based Clustering Methods (II):Handling Soft Constraints

  • Treated as an optimization problem: When a clustering violates a soft constraint, a penalty is imposed on the clustering
  • Overall objective: Optimizing the clustering quality, and minimizing the constraint violation penalty
  • Ex. CVQE (Constrained Vector Quantization Error) algorithm: Conduct k-means clustering while enforcing constraint violation penalties
  • Objective function: Sum of distance used in k-means, adjusted by the constraint violation penalties
    • Penalty of a must-link violation
      • If objects x and y must-be-linked but they are assigned to two different centers, c1 and c2, dist(c1, c2) is added to the objective function as the penalty
    • Penalty of a cannot-link violation
      • If objects x and y cannot-be-linked but they are assigned to a common center c, dist(c, c′), between c and c′ is added to the objective function as the penalty, where c′ is the closest cluster to c that can accommodate x or y

Speeding Up Constrained Clustering

  • It is costly to compute some constrained clustering
  • Ex. Clustering with obstacle objects: Tung, Hou, and Han. Spatial clustering in the presence of obstacles, ICDE'01
  • K-medoids is more preferable since k-means may locate the ATM center in the middle of a lake
  • Visibility graph and shortest path
  • Triangulation and micro-clustering
  • Two kinds of join indices (shortest-paths) worth pre-computation
    • VV index: indices for any pair of obstacle vertices
    • MV index: indices for any pair of micro-cluster and obstacle indices

An Example: Clustering With Obstacle Objects

User-Guided Clustering: A Special Kind of Constraints

  • X. Yin, J. Han, P. S. Yu, “Cross-Relational Clustering with User's Guidance”, KDD'05
  • User usually has a goal of clustering, e.g., clustering students by research area
  • User specifies his clustering goal to CrossClus

Comparing with Classification

  • User-specified feature (in the form of attribute) is used as a hint, not class labels
    • The attribute may contain too many or too few distinct values, e.g., a user may want to cluster students into 20 clusters instead of 3
    • Additional features need to be included in cluster analysis

Comparing with Semi-Supervised Clustering

  • Semi-supervised clustering: User provides a training set consisting of “similar” (“must-link) and “dissimilar” (“cannot link”) pairs of objects
  • User-guided clustering: User specifies an attribute as a hint, and more relevant features are found for clustering

Why Not Semi-Supervised Clustering?

  • Much information (in multiple relations) is needed to judge whether two tuples are similar
  • A user may not be able to provide a good training set
  • It is much easier for a user to specify an attribute as a hint, such as a student’s research area

CrossClus: An Overview

  • Measure similarity between features by how they group objects into clusters
  • Use a heuristic method to search for pertinent features
    • Start from user-specified feature and gradually expand search range
  • Use tuple ID propagation to create feature values
    • Features can be easily created during the expansion of search range, by propagating IDs
  • Explore three clustering algorithms: k-means, k-medoids, and hierarchical clustering

Multi-Relational Features

  • A multi-relational feature is defined by:
    • A join path, e.g., Student → Register → OpenCourse → Course
    • An attribute, e.g., Course.area
    • (For numerical feature) an aggregation operator, e.g., sum or average
  • Categorical feature f = [Student → Register → OpenCourse → Course, Course.area, null]

Representing Features

  • Similarity between tuples t 1 and t 2 w.r.t. categorical feature f
    • Cosine similarity between vectors f ( t 1) and f ( t 2)


  • Most important information of a feature f is how f groups tuples into clusters
  • f is represented by similarities between every pair of tuples indicated by f
  • The horizontal axes are the tuple indices, and the vertical axis is the similarity
  • This can be considered as a vector of N x N dimensions  

Similarity Between Features

Computing Feature Similarity

Searching for Pertinent Features

  • Different features convey different aspects of information

  • Features conveying same aspect of information usually cluster tuples in more similar ways
    • Research group areas vs. conferences of publications
  • Given user specified feature
    • Find pertinent features by computing feature similarity

Heuristic Search for Pertinent Features

  • Overall procedure

1. Start from the user- specified feature

2. Search in neighborhood of existing pertinent features

3. Expand search range gradually

  • Tuple ID propagation is used to create multi-relational features
    • IDs of target tuples can be propagated along any join path, from which we can find tuples joinable with each target tuple

Clustering with Multi-Relational Features

  • Given a set of L pertinent features f1, …, fL, similarity between two tuples

\[ sim(t_{1},t_{2})=\sum_{i=1}^{L}sim_{f_{i}}(t_{1},t_{2}).f_{i}.weight \]

    • Weight of a feature is determined in feature search by its similarity with other pertinent features
  • Clustering methods
    • CLARANS [Ng & Han 94], a scalable clustering algorithm for non-Euclidean space
    • K-means
    • Agglomerative hierarchical clustering

Experiments: Compare CrossClus with

  • Baseline: Only use the user specified feature
  • PROCLUS [Aggarwal, et al. 99]: a state-of-the-art subspace clustering algorithm
    • Use a subset of features for each cluster
    • We convert relational database to a table by propositionalization
    • User-specified feature is forced to be used in every cluster
  • RDBC [Kirsten and Wrobel’00]
    • A representative ILP clustering algorithm
    • Use neighbor information of objects for clustering
    • User-specified feature is forced to be used

Measure of Clustering Accuracy

  •  Accuracy
    • Measured by manually labeled data
      • We manually assign tuples into clusters according to their properties (e.g., professors in different research areas)
    • Accuracy of clustering: Percentage of pairs of tuples in the same cluster that share common label
      • This measure favors many small clusters
      • We let each approach generate the same number of clusters

DBLP Dataset