Breaking News

Shortcomings of silhouette in single-cell integration benchmarking

Integrating single-cell data remains a key challenge because of increasing dataset complexity and volume. These datasets comprise batch effects arising from technical factors (for example, assays and protocol), alongside meaningful biological variation (for example, distinct tissue sampling regions), requiring rigorous evaluation of integration methods to ensure accurate integration and interpretation. We focus on methods for horizontal integration (a term coined by Argelaguet et al.1), defined as integrating datasets using shared features (for example, genes) aiming to remove batch effects while preserving biological variation. Although relevant to distinct output types, we focus on integrated embeddings, low-dimensional data representations derived from integration methods.

Silhouette-based evaluation metrics, which we find are unreliable for horizontal integration, have become widely adopted to address this challenge. The metric ‘silhouette’ scores clustering quality by comparing within-cluster cohesion to between-cluster separation2 and was developed for evaluating unsupervised clustering results of unlabeled data (internal evaluation). In line with its original intent, silhouette was taken up for determining the optimal number of clusters in single-cell datasets for a given embedding3,4. More recently, silhouette has been adapted for evaluating horizontal data integration, for instance, to score bio-conservation by assessing how well cell type annotations (based on labeled data; that is, external evaluation) from distinct batches cocluster in distinct embeddings5,6,7. From 2017 onward, silhouette-based metrics have also been used for scoring batch effect removal5,7,8,9. Here, researchers attempt to invert the silhouette concept to score how well cells from distinct batches (external labels) mix. Silhouette-based metrics for both bio-conservation and batch removal have been widely adopted across the field, as evidenced by their application in multiple large-scale benchmarks10,11,12. In Nature Portfolio journals alone, we found evidence for their use in 66 publications for evaluating batch removal (Extended Data Fig. 1 and Supplementary Table 1). Notably, these studies extend beyond single-cell sequencing data, encompassing spatial transcriptomics and image-based single-cell modalities.

Silhouette-based metrics suffer from fundamental, largely overlooked limitations for evaluating horizontal data integration. To expose these issues, we first formalize the silhouette score and its adaptations for single-cell integration tasks. Using simple simulations, we demonstrate how the metric’s assumptions are violated under basic conditions, misleadingly rewarding poor integration. We then validate these findings in real-world datasets, proving that these issues persist beyond theoretical scenarios.

The silhouette coefficient for a cell i assigned to a cluster \({C}_{k}\), denoted \({s}_{i}\), is defined as follows. Given \({a}_{i}\) (the mean distance between a cell i and all other cells in the same cluster \({C}_{k}\)) and \({b}_{i}\) (the mean distance between a cell i and all other cells in the nearest (neighboring) other cluster \({C}_{l}\), where \(l\ne k\)), \({s}_{i}\) is given by

$${s}_{i}=\frac{{b}_{i}-{a}_{i}}{\max ({a}_{i},{b}_{i})}$$

(1)

Conventionally and if not stated otherwise, Euclidean distance is used. Note that \({s}_{i}\) is only defined for \({2} \le n \;{\rm{clusters}} \le n \;{\rm{cells}}-1\) and ranges between −1 and 1, with 1 indicating good cluster separation (\({a}_{i}\ll {b}_{i}\)), values near 0 indicating cluster overlap (\({a}_{i}={b}_{i}\)) and −1 indicating wrong cluster assignment (\({a}_{i}\gg {b}_{i}\)). In contrast to the use of silhouette for internal clustering evaluation (unsupervised clustering), for scoring data integration in the single-cell field, cells are not assigned to clusters in a data-driven manner, for example, by the result of a clustering algorithm, but by external information, such as cell type or batch labels.

For scoring bio-conservation, cell type labels serve as cluster assignments. First, the average silhouette width (ASW) is calculated across all cells (unscaled cell type ASW). Following common practice, we use a rescaled version:

$${\rm{Cell}}\;{\rm{type}}\;{\rm{ASW}}=({\rm{unscaled}}\;{\rm{cell}}\;{\rm{type}}\;{\rm{ASW}}+1)/2$$

(2)

Notably, a score of 0.5 corresponds to an unscaled ASW of 0, indicating overlaps between cell types, an undesirable outcome. Higher values indicate better performance.

For scoring batch effect removal, batch labels serve as cluster assignments. Here, the goal is to measure cluster overlap rather than separation. Considering this context, researchers made the assumption that silhouette values \({s}_{i}\) around 0 indicate a high level of batch overlap. Two approaches exist.

Early adoptions, which remain in use, use a simple formulation where all cells from a given batch are assigned to a single cluster, which we refer to as batch ASW (global). This approach often computes 1 − batch ASW (global) or 1 − |batch ASW (global)|, with higher scores interpreted as better performance.

Luecken et al.11 acknowledged problems with differences in cell type composition between batches and thus introduced a modified version of batch ASW computed separately for each cell type. For a given cell type label \(j\) with \(|{C}_{j}|\) cells, the score is calculated as:

$${\rm{Batch}}\;{\rm{AS}}{{\rm{W}}}_{{j}}\;({\rm{cell}}\;{\rm{type}})=\frac{1}{|{C}_{j}|}\sum _{{i}\epsilon \,{C}_{j}}1-|{s}_{i}|$$

(3)

The final batch ASW (cell type) score (batch ASW from here on) is obtained by averaging across the scores for all cell type labels.

When repurposing the silhouette metric for evaluating horizontal data integration, researchers make two key changes compared to its original application. First, they use label-based rather than algorithmic cluster assignment. Second, they compare silhouette scores across the outputs of different methods (across embeddings) instead of relative to the output of a single method. We demonstrate how these and other conceptual changes inherently constrain the silhouette metric’s effectiveness for assessing horizontal integration using two-dimensional (2D) simulated data (Fig. 1).

Fig. 1: Silhouette’s assumptions are not met in data integration contexts.

a, Silhouette was designed to select a suitable cluster number for a single embedding, with cluster membership resulting from unsupervised algorithms2. bd, In data integration, we compare distinct embeddings and assign cluster membership by external labels: cell type (b,c) or batch (d). b, Silhouette’s bias for compact, spherical clusters does not reflect integration quality. c, Label-based clusters can have irregular shapes, violating silhouette’s assumptions and yielding unreliable scores. d, Silhouette’s focus on nearest neighboring clusters misses remaining batch effects if samples are partially integrated, limiting its sensitivity. All data shown are 2D simulated examples.

Concerning bio-conservation evaluation, when comparing silhouette scores across distinct methods’ outputs, silhouette’s inherent preference for compact, spherical, well-separated clusters conflicts with biological reality, where such geometric properties bear no meaningful relationship to cellular state. This manifests in the metric resulting in different scores for distinct but biologically equally valid embeddings (Fig. 1b). Additionally, label-based assignments can produce irregular cluster geometries that would never emerge from algorithmic clustering (for example, batch-induced distortions), violating the metric’s assumption about cluster shapes. Silhouette’s behavior becomes unreliable, as demonstrated by identical silhouette scores representing radically different scenarios (Fig. 1c).

Concerning batch effect removal, irregular cluster geometries are the default for batch ASW (global), where all cells from a given batch are forced into a single cluster regardless of cell type diversity, producing erratic scores that fail to reflect integration quality (Extended Data Fig. 2), which is why we generally discourage its use. Additionally, silhouette (Eq. (1)) considering the mean distance between a cell i and all other cells in the nearest (neighboring) other cluster \({C}_{l}\) (\({b}_{i}\)) is problematic in the batch removal context, affecting both batch ASW (global) and the cell type-adjusted batch ASW. For simplicity, consider integrating multiple datasets (samples) with a single cell type, where the aim is to score cluster overlap and not separation. A value for \({s}_{{i}}\) around 0 is attainable if a given cluster overlaps with just a single other cluster and could still be very distinct from all other remaining ones. Thus, silhouette-based batch removal metrics can result in maximal scores when all samples are integrated with subsets of the other samples despite remaining strong batch effects (Fig. 1d), which we call ‘nearest-cluster issue’.

These limitations are also painfully obvious in real datasets. For simplicity, we limit our analyses to healthy samples and treat interdonor variation as negligible noise. A discussion of strategies for evaluating heterogeneous sample integration can be found in Supplementary Note 2. We discovered the nearest-cluster issue for batch ASW in the context of the NeurIPS 2021 challenge13, where the benchmark data have a nested experimental design and intersite technical variation is larger than intrasite variation between samples of distinct donors. Choosing a single-cell RNA sequencing (scRNA-seq) subset (‘minimal example’) of this data with four batches nested into two groups (sites), we compare metric performance on unintegrated, suboptimally integrated and effectively integrated and optimized (with respect to batch removal) integrated data with liam14 (Fig. 2a). Batch ASW fails to rank embeddings accurately and even favors worse embeddings with stronger batch effects (Fig. 2b), with the same observations applying to the full dataset (Extended Data Fig. 3b). Cell type ASW assigns almost identical scores to unintegrated and suboptimally integrated embeddings of the minimal example and the full data (Fig. 2b and Extended Data Fig. 3b), reflecting fundamental limitations in its discriminative power.

Fig. 2: Silhouette-based metrics are unreliable for assessing bio-conservation and batch effect removal.
figure 2

a, Uniform manifold approximation and projections (UMAPs) of NeurIPS minimal example embeddings integrated with increasing success, colored by cell type and sample. b,d, Batch removal metrics: batch ASW, BRAS and an alternative cell-type-adjusted diversity score, CiLISI. Bio-conservation metrics: cell type ASW and ARI. c, UMAPs of healthy HLCA embeddings integrated with increasing success colored by cell type and dataset, shown for a consistent random 10% data subset. Suboptimal embeddings were obtained through batch-aware HVG selection for specified batch variables.

The violation of silhouette’s assumptions and resulting unreliability is not limited to datasets with controlled nested experimental designs. We demonstrate this by extending our analysis to two recent atlas-level studies, which differ in batch effect severity, cell type complexity and granularity of provided annotations: the healthy subset of the Human Lung Cell Atlas (HLCA)15 and the genetically diverse Human Breast Cell Atlas (HBCA)16. Using author-provided integrated embeddings, we compare those to unintegrated and naively integrated embeddings (Fig. 2c and Extended Data Fig. 4a). For HLCA, the batch ASW metric shows limited discriminative power but ranks embeddings correctly (Fig. 2d), whereas, for HBCA, it inversely ranks embeddings, favoring the worst integration (Extended Data Fig. 4b). Regarding bio-conservation, cell type ASW indicates comparable performance for naive and integrated embeddings in HLCA (Fig. 2d). However, in HBCA, which has well-separated cell types and limited batch effects, cell type ASW retrieves the expected ranking (Extended Data Fig. 4b).

Single-cell integration benchmarking is an area of active research, which has seen large-scale coordinated efforts and typically includes a multitude of metrics extending beyond silhouette-based metrics10,11,12,17,18. Unanimously, it has been suggested that two classes of metrics should be considered to score horizontal data integration: batch removal and bio-conservation metrics10,11,18, which we introduce in detail in Supplementary Note 1. Concerning alternatives to silhouette for evaluating batch effect removal robust to the nearest-cluster issue, we find that combining a cell-type-adjusted local mixing batch removal with bio-conservation metrics on a cell type level is a successful strategy. For example, applying CiLISI (cell type integration local inverse Simpson’s index)19 with adjusted Rand index (ARI) leads to accurate rankings across datasets with the bio-conservation metric flagging overcorrection (Extended Data Fig. 5). It is also possible to ‘fix’ the silhouette-based metric batch ASW to be robust to the nearest-cluster issue by redefining \({b}_{i}\) as the mean distance between a cell i and all other cells in any other cluster \({C}_{l}\) with \(l\ne k\). Changing Euclidean to cosine distance results in higher discriminative power. We call this metric batch-removal-adapted silhouette (BRAS; available through the scib-metrics package as of version 0.5.5; further details in Extended Data Figs. 3–6 and Methods, including a BRAS variant considering the furthest other cluster). Like CiLISI, the BRAS metric also accurately ranks all real and simulated scRNA-seq data (Fig. 2b,d and Extended Data Figs. 3b–5b and 6). The notable BRAS–CiLISI score divergence in HLCA embeddings (Fig. 2d) reflects their distinct focuses; while CiLISI evaluates (cell-type-adjusted) local batch mixing, BRAS is less sensitive to local compositional differences. Metric selection and weighting should align with integration objectives, as discussed in Supplementary Note 2; a discussion on how other identified silhouette limitations affect BRAS is provided in Supplementary Note 4. In search for alternatives to the unreliable silhouette for evaluating bio-conservation at the cell type annotation level, cLISI exhibits low discriminative power. However, the external clustering metrics ARI and normalized mutual information (NMI) reliably rank embeddings as anticipated (Fig. 2b,d and Extended Data Figs. 3b–5b and 6). Details on how clustering strategies influence ARI and NMI can be found in Supplementary Note 3; additional metrics scoring other aspects of horizontal integration are presented in Supplementary Note 1.

Our investigation reveals the inadequacy of currently prevalent silhouette-based evaluation metrics for assessing data integration caused by the violation of silhouette’s underlying assumptions. Silhouette’s inability to handle biologically realistic, nonconvex clusters persists across bio-conservation and batch removal evaluation, with the nearest-cluster issue further compounding batch removal evaluation. We outline robust alternatives, including a batch removal metric adjusting silhouette to be more robust to the discussed limitations, and urge discontinuing unadjusted silhouette-based metrics in data integration benchmarking. This is required to ensure reliable method assessment and method choice impacts downstream analyses.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button