Anticlustering and Multi-objective Optimization
Traditionally in the realm of data science and optimization, we often encounter problems that require grouping similar items together. The task of finding these groups and maximizing homogeneity is known as clustering. However, there are scenarios where we need to do the opposite - create groups that are as heterogeneous as possible, which leads to the emergence of anticlustering techniques.
The fundamental principle of anticlustering aligns with the concept of intra-group dissimilarity maximization, contrasting sharply with conventional clustering techniques that aim to minimize within-group differences. Although this may simply seem like a clustering task within an inverted objective function, anticlustering presents a unique set of difficulties that make it more computationally complex. Intuitively, it should be easier to spatially find groups in a Euclidean space that are close to one another rather than the groups that contain points furthest from one another: the partitioning of the space can be clean in the former but rarely so in the latter.
Combinatorial Approaches
The anticlustering problem can be framed in a variety of ways; one of them notably is the partitioning of a fully connected graph. In this context, we're looking for a set of subgraphs that maximize our dissimilarity across this graph partition. To take a combinatorial approach, we must form edge weights between the entries in our dataset to represent them as nodes in this fully connected graph.
Given a dataset:
$D = {(i_1, \mathbf{x}_1), (i_2, \mathbf{x}_2), \ldots, (i_n, \mathbf{x}_n)}$ where:
- $i_k \in \mathbb{N}$ is a unique identifier for each data point
- $\mathbf{x}_k \in \mathbb{R}^d$ is a $d$-dimensional vector of auxiliary features
We require a distance measure $d: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}_{\geq 0}$ operating on pairs of feature vectors where:
-
$d(\mathbf{x}, \mathbf{y}) \geq 0$
-
$d(\mathbf{x}, \mathbf{y}) = 0$ if and only if $\mathbf{x} = \mathbf{y}$
-
$d(\mathbf{x}, \mathbf{y}) = d(\mathbf{y}, \mathbf{x})$
For example, $\mathbf{x}_k$ could represent attributes such as age, income, or geographical coordinates, or even categorical variables like occupation or industry. Common choices for $d$ include Euclidean distance, cosine dissimilarity for numerical variables, or metrics such as overlap distance or Jensen-Shannon divergence for categorical variables.
With such a metric defined, we can formally specify a fully connected weighted graph $G = (V, E, w)$ where:
- $V = {1, \ldots, n}$ is the set of vertices (corresponding to our data points)
- $E = {{i, j} : i, j \in V, i \neq j}$ is the set of all possible edges
- $w: E \rightarrow \mathbb{R}_{\geq 0}$ is the weight function defined as $w({i, j}) = d(\mathbf{x}_i, \mathbf{x}_j)$ for all ${i, j} \in E$
A Motivating Example
Suppose we're conducting a mass hiring operation for a company. We have many interviewers and many more applicants. The interviewers have varied capacities to interview and can only interview a certain number of applicants in this hiring season. We have some high level data on the applicants, consider the following dataset:
| Name | University | Year of Study | Major | Grade Average | Total Credits | Attendance Rate |
|---|---|---|---|---|---|---|
| Lisa Dodson | University of Montreal | Fourth Year | Psychology | 0.5328 | 270.3 | 0.666 |
| Joshua Cox | University of Calgary | Second Year | Mathematics | 0.9212 | 263 | 0.949 |
| Danny Le | Dalhousie University | First Year | Physics | 0.2476 | 9.9 | 0.635 |
We now may want to form groupings of applicants to be assigned to each interviewer. To reduce systematic bias, we desire that the interviewers do not interview too many applicants from similar backgrounds to ensure they're exposed to a variety of experiences. This is where anticlustering comes in.
We now require a distance metric to form our dissimilarity matrix. In this case we can apply Gower's distance to handle the combination of numerical and categorical data. Applying the metric to this dataset yields the following dissimilarity matrix:
| Student | Lisa Dodson | Joshua Cox | Danny Le |
|---|---|---|---|
| Lisa Dodson | 0.0000 | 0.7866 | 0.7889 |
| Joshua Cox | 0.7866 | 0.0000 | 0.9960 |
| Danny Le | 0.7889 | 0.9960 | 0.0000 |
Such dissimilarity matrices, which represent our graph $G$, will be the ideal input for our anticlustering algorithm.
Multi-objective Optimization
In optimization problems, we're occasionally concerned with more than one objective function. Commonly, these objectives are conflicting, meaning that there is generally no single solution that optimizes them all simultaneously. In such cases, we need to find a set of trade-off solutions that collectively represent the best possible outcomes.
Borrowing from ideas of game theory, we often use the concept of Pareto optimality. A solution is Pareto optimal if no objective can be improved without degrading at least one other objective. There can generally be more than one solution that exists that remains Pareto optimal, these set of solutions form what is known as the Pareto set.
The solutions in the Pareto set are non-dominated, meaning no solution in this set can be improved in any objective without degrading at least one other objective. Formally, we define the Pareto set $P$ for a k objective optimization problem as:
$$ P = { \chi_i = \langle \theta_1, \ldots \theta_k \rangle, \theta_i \in \mathbb{R} } $$
where $ \chi_i $ is a Pareto optimal solution and $\theta_j \in \mathbb{R}$ is a parameter of interest for the $j$-th objective function $f_j$. The Pareto set requires that for any two solutions $\chi_a$ and $\chi_b$ in the Pareto set $P$, neither dominates the other. In other words:
$$ \nexists \chi_a, \chi_b \in P \text{ such that } \forall j \in {1, \ldots, k}: f_j(\chi_a) \geq f_j(\chi_b) $$
and
$$ \exists j \in {1, \ldots, k}: f_j(\chi_a) > f_j(\chi_b) $$
Where $f_j$ is the $j$-th objective function. In other words, for any pair of solutions in the Pareto set, each is better than the other in at least one objective, ensuring that all solutions in the set are non-dominated with respect to one another. We can graphically represent the Pareto set with a frontier demonstrating the trade-offs between the various objectives, called the Pareto front.
In the case of combinatorial optimization for anticlustering, the solutions $\chi_i$ correspond to partitions of the graph $\pi_i$. Moving forward we will denote:
- $n \in \mathbb{N}$ as the number of data points in the dataset
- $\langle N_1, \ldots N_k \rangle | N_i \in \mathbb{N}$ as the group sizes for all partitions where there are $k \in \mathbb{N} $ many groups
- A partition $\pi = {S_1, \ldots, S_k}$ of the data points into $k$ groups where $S_i$ is the $i$-th group/sub-graph
Multi-restart Bicriterion Pairwise Interchange
The MBPI algorithm, presented by Brusco et al. in 2017, is a multi-objective combinatorial optimization algorithm for anticlustering. It falls in the class of stochastic search algorithms, which means that it uses a random source to facilitate exploration of the solution space. The algorithm is a bicriterion optimizer, in particular it seeks to maximize two metrics across a partition of the graph: the diversity and dispersion of the groups.
We define the diversity of a partition as the sum of the distance values within each group:
$$ F_1(\pi) = \sum_{m=1}^k \sum_{(u < v) \in S_m} d_{uv}, \text{ for } 1 \leq m \leq k $$
This measure effectively sums the upper triangular elements of the sub-distance matrices formed by a partition, capturing the pairwise dissimilarities within the group. The dispersion of a partition is defined as the minimum of the dissimilarity values between groups:
$$ F_2(\pi) = \min_{1 \leq m \leq k} \left( \min_{(u<v) \in S_m} {d_{uv}} \right) $$
The algorithm uses a weighted combination of these two objectives to evaluate the progress of a local search procedure, known as a single restart.
$$ Z(\pi) = \omega_1 F_1(\pi) + \omega_2 F_2(\pi) $$
where $\omega_1, \omega_2 \in \mathbb{R}_{\geq 0} \text{ s.t. } \omega_1 + \omega_2 = 1$ are weighting factors supplied by the user. The algorithm thus attempts to maximize the above while evaluating the dominance of candidate solutions with respect to the Pareto set.
Typically many anticlustering algorithms will simply focus on the diversity objective, giving $\omega_2 = 0$. However, this can lead to a significant unbalancing of certain groups, which is the motivation for the use of the dispersion objective. This is especially important in the case of the hiring problem, where we may desire that every group is balanced in terms of applicant dissimilarity.
If we considered simply optimizing the dispersion metric, we'd find a significant amount of ties in potential candidate solutions which can appear vastly different from one another. The diversity metric, on the other hand, will see less ties. Having both criteria present allows us to also evaluate the concordance of the two measures, providing a more nuanced set of insights regarding the graph in question.
The algorithm can be described as follows:
Input:
- A: Distance matrix
- R: Number of restarts
- n: Number of data points
- ⟨N₁, ..., Nₖ⟩: Desired group sizes for all 1 ≤ i ≤ k
Initialization:
- P: Pareto set, initially empty (contains non-dominated partitions)
- Generate initial random partitions of n data points into k groups, ensuring group sizes are respected.
Algorithm:
For each restart r from 1 to R:
Set Up Random Weights:
Select weights ω₁ and ω₂ such that 0 ≤ ω₁ ≤ 1 and ω₂ = 1 - ω₁.
Weights are chosen randomly from a set of predefined options to ensure diverse trade-offs.
Generate Initial Partition π:
Randomly assign n data points to k groups, preserving group sizes based on ⟨N₁, ..., Nₖ⟩.
This partition serves as the starting point for the pairwise interchange heuristic.
Pairwise Interchange Local Search:
Repeat until no improvements:
For each pair of data points u and v (where u and v belong to different groups):
Temporarily swap u from group Sᵢ to group Sⱼ.
Recalculate the objective functions F₁(π) and F₂(π) (diversity and dispersion).
Compute the weighted sum of the objectives Z(π) = ω₁*F₁(π) + ω₂*F₂(π).
Check Pareto Dominance:
If the new partition π' improves upon Z* and is not dominated by any partition in the current Pareto set P:
Add π' to P.
Remove any dominated partitions from P (i.e., if π' dominates π, remove π from P).
Update Z* and set the improvement flag to True.
If π' is worse or dominated, revert the swap and retain the previous partition.
Convergence and Restart:
Once no further pairwise swaps yield improvements (i.e., no new π' dominates the current Pareto set):
Increment the restart counter r and repeat the process from the beginning (with new random weights and a new partition).
Termination:
After completing R restarts, return the Pareto set P.
Output:
- Return the Pareto set P, which contains all non-dominated partitions found across all restarts. Each partition in P represents a different trade-off between diversity and dispersion.
The restart is thus the key mechanism which escapes local maxima. A restart begins with a new random partition, which the local search will be highly dependent on given pairwise interchanges are miniscule steps which do not drastically change the Kendall-Tau distance of a given partition.
To find an implementation of the MBPI algorithm in Python, see my own.
Limitations
While the MBPI algorithm offers a powerful approach to multi-objective partitioning, it's important to acknowledge its limitations:
-
Approximation: Given this is a heuristic algorithm, there is no guarantee of convergence to the Pareto set.
-
Sensitivity to Initial Conditions: The quality of solutions can be influenced by the initial random partitions and weight selections, potentially leading to variability in results across different runs.
-
Restart Sensitivity and Local Optima: As with many local search algorithms, MBPI does not guarantee finding the global optimum. The pairwise interchange and iterative local search may get trapped in local optima, potentially missing the true Pareto-efficient set.
-
Computational Intensity: The computational intensity of the MBPI is proportional to the number of restarts $R$ and the number of nodes $n$ in the graph. For large datasets or complex problems, we will typically need to run the algorithm with many restarts for sufficient exploration of the solution space, making the algorithm intractable at certain scales.
Given the problem at hand is one that is NP-hard, we can likely not expect to step away from a heuristic based approach for the time being. Potential approaches to address other shortcomings include parallelized implementations, adaptive restarts and intermediate heuristics or hybridization with other optimization techniques.
References
- Brusco, M.J., Cradit, J.D. and Steinley, D. (2020), Combining diversity and dispersion criteria for anticlustering: A bicriterion approach. Br J Math Stat Psychol, 73: 375-396.