Unlocking the secrets of cellular similarity: how SCimilarity transforms single-cell data into insights on disease, development, and tissue biology.
SCimilarity search engine. Image Credit: SCimilarity
In a recent study published in the journal Nature, researchers in Canada and the United States developed Single-Cell Similarity (SCimilarity), a framework for rapid, interpretable searches of single-cell or single-nucleus Ribonucleic Acid -seq (sc/snRNA-seq) data. This framework enables the discovery of similar cell states across the Human Cell Atlas.
Background
Over 100 million cells have been profiled using sc/snRNA-seq across various conditions, providing unprecedented opportunities to link cell states across development, tissues, and diseases. However, large-scale analyses remain limited due to challenges in dataset harmonization, defining shared representations, and lack of robust similarity metrics or scalable search methods.
Current approaches often fail to generalize across datasets and cannot efficiently query massive atlases for similar cell profiles. Further research is needed to develop foundational models that enable accurate, scalable, and interpretable searches, unlocking the full potential of single-cell atlases to advance biological discovery.
About the study
scRNA-seq has profiled millions of individual cells across various tissues, conditions, and diseases, offering transformative opportunities to link cellular states across contexts.
Effective comparisons between datasets, however, remain limited due to challenges in harmonizing diverse data, defining common representations, and developing accurate metrics to quantify cellular similarity.
While preserving dataset-specific information, existing models often fail to generalize or efficiently search large atlases for comparable cell states.
Metric learning, a technique successfully applied in fields like image processing, offers a promising solution. By embedding cell profiles into a shared low-dimensional space, it becomes possible to identify biologically similar cells across vast datasets. Such representations could enable scalable, interpretable searches for cells in diverse contexts, facilitating cross-dataset comparisons and biological discovery
Study results
SCimilarity demonstrated generalization across diverse single-cell profiling platforms. Although trained primarily on 10x Genomics Chromium data, it effectively embedded and annotated cell profiles from multiple platforms, including scRNA-seq and snRNA-seq datasets.
For example, human peripheral blood mononuclear cells (PBMC) samples profiled across seven platforms exhibited consistent cross-platform annotation precision, except for rare cell types like conventional dendritic cells (cDCs) and plasmacytoid dendritic cells (pDCs).
While minor differences in embedding distances were observed, particularly for non-10x platforms such as Switching Mechanism At 5' End of RNA Template sequencing (SMART-Seq2), SCimilarity maintained high performance, showcasing its adaptability to diverse data sources.
A key advantage of SCimilarity is its ability to integrate datasets without explicit batch correction. By quantifying representation confidence for individual cells, the model identifies outliers and assesses its generalization to new data. For example, low-confidence annotations were associated with poorly represented tissues in training data, such as the stomach and bladder. This capability enabled the construction of an atlas spanning 30 human tissues and facilitated pan-tissue comparisons.
The model also excelled in annotating cell types through its embedding-based similarity measure. SCimilarity annotated individual cells independently, circumventing the need for clustering and retrieving the most similar cells efficiently. It achieved competitive accuracy with existing methods like single-cell ANnotation using Variational Inference (scANVI) and CellTypist, even matching fine-grained annotations supported by protein markers. For example, SCimilarity annotated 86.5% of cells in healthy kidney samples correctly when compared to author-provided labels, performing on par with tissue-specific models.
SCimilarity’s interpretability was validated using Integrated Gradients, which identified critical gene contributions to cell type annotations. These gene attributions aligned well with known markers for major cell types, such as surfactant genes distinguishing lung alveolar type 2 (AT2) cells. This demonstrates SCimilarity's capacity to capture biologically meaningful features without prior knowledge of cell type-specific signatures.
The model’s query capabilities were tested using fibrosis-associated macrophages (FMΦs) and myofibroblasts in interstitial lung disease (ILD). SCimilarity identified FMΦ-like cells across ILD datasets, cancers, and other fibrotic diseases, revealing shared cellular states. Notably, it uncovered FMΦs in rare contexts, such as pancreatic ductal adenocarcinoma (PDAC), suggesting their broader relevance in fibrosis.
To further explore its utility, SCimilarity searched for FMΦ-like cells in vitro. Surprisingly, it identified cells cultured in a 3D hydrogel system as transcriptionally similar to FMΦs. Experimental validation confirmed SCimilarity’s prediction, demonstrating its potential to identify novel experimental conditions and model disease-relevant cell states in vitro.
Conclusions
To summarize, SCimilarity advances single-cell analysis by enabling scalable and efficient searches across diverse scRNA-seq and snRNA-seq datasets.
Built on metric learning, it provides annotation and querying of cell profiles, leveraging full expression profiles to reduce biases from curated gene signatures. SCimilarity excels in identifying transcriptionally similar cells, facilitating discoveries of novel states like FMΦs and myofibroblasts across diseases.
Its ability to generalize to unseen datasets and its open-source availability make it a foundational tool for exploring the Human Cell Atlas, supporting diverse biological investigations, and uncovering insights into human biology and disease mechanisms.
Source:
Journal reference: