By integrating DNA sequence and epigenetic context, CpGPT sets new standards for predicting aging-related outcomes, offering unprecedented accuracy in assessing mortality and disease risk across various datasets.
Study: CpGPT: a Foundation Model for DNA Methylation. Image Credit: Shutterstock AI
*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
In a recent pre-print* study posted to the bioRxiv server, a team of researchers introduced the Cytosine-phosphate-Guanine Pretrained Transformer (CpGPT: a transformer-based foundation model for deoxyribonucleic acid (DNA) methylation) designed to enhance analysis and prediction across diverse tissues and conditions.
Background
Since the advent of transformer architecture, artificial intelligence has rapidly progressed, especially through foundation models and large language models (LLMs) that utilize self-attention to capture complex patterns. Transformers have significantly impacted biology and medicine, advancing single-cell transcriptomics and revealing previously unknown biology with models like single-cell GPT (scGPT) and Geneformer. Despite progress in aging research, many epigenetic aging clocks still rely on simple linear models using CpG DNA methylation data, often overlooking sequence context and complex interactions. Few predictors, such as AltumAge and DeepMAge, employ deep neural networks. Further research is needed to develop advanced models that better capture the intricate mechanisms of aging.
About the study
To develop the CpGPT model, the researchers curated a comprehensive DNA methylation dataset named "CpGCorpus," aggregating data from more than 1,502 studies and over 106,000 human samples available in the Gene Expression Omnibus. This dataset contained various Illumina methylation array platforms and represented a rich diversity of tissue types, developmental stages, disease conditions, and demographic backgrounds. Raw data were processed using a Single Sample Methylation Analysis pipeline (SeSAMe), while normalized beta value matrices were used for already processed data. Quality control measures and probe harmonization were applied to ensure consistency across the dataset. The data were split into training, validation, and test sets without overlapping samples or studies.
The CpGPT model integrated sequence, positional, and epigenetic information. Input representations included "embeddings of the nucleotide sequences" obtained from a pre-trained DNA language model, methylation beta values representing the methylation state of each site, and genomic positional encoding to capture the CpG site's location within the genome. A dual positional encoding strategy was employed, combining absolute and relative positional encodings to capture multi-scale genomic information. Specialized decoders were designed for beta value prediction, condition prediction, and uncertainty estimation.
Pretraining was conducted using a multi-task learning approach with tailored loss functions, optimizing the model's ability to reconstruct missing data and learn meaningful sample representations. For fine-tuning, CpG sites associated with mortality were selected based on intra-class correlation coefficients and z-score thresholds. The model was then trained using a modified Cox proportional hazard loss. Predictive performance for mortality and morbidity was evaluated across multiple cohorts using Cox regression models, receiver operating characteristic analyses, and survival analyses, adjusting for age and employing appropriate statistical methods.
Study results
The researchers developed CpGPT, which includes over 100,000 human DNA methylation samples from more than 1,500 studies covering a diverse range of tissue types, developmental stages, and disease conditions. The data were thoroughly preprocessed and harmonized to ensure consistency across various Illumina methylation array platforms, such as the HumanMethylation450 BeadChip (450k), HumanMethylation27 BeadChip (27k), Infinium MethylationEPIC BeadChip (EPIC), EPIC+, and EPICv2 arrays.
CpGPT integrates three key types of contextual information: sequence context based on the DNA nucleotides near each CpG site, positional context covering local and global information, and epigenetic state. Sequence context is encoded using embeddings of nucleotide sequences surrounding each CpG site, derived from a pre-trained DNA language model. The model organizes sequence embeddings by genomic positions to capture positional context, groups them by chromosomes, and applies stochastic shuffling to prevent positional biases. Each CpG site's methylation state is transformed into an embedding representing its epigenetic status, and these embeddings are combined to form the model's input.
The core architecture of CpGPT is based on the Transformer++ model, an enhanced version of the transformer architecture with modifications for increased training stability and accuracy. The model is trained in an unsupervised manner to predict methylation states (beta values) and their uncertainties, enabling it to generate meaningful sample-level embeddings that encapsulate comprehensive methylation profiles. The training process employs multiple loss functions to optimize various performance aspects and is designed to handle missing data effectively.
Evaluations using dimensionality reduction techniques revealed that CpGPT's locus embeddings naturally reflect functional genomic annotations, with CpG sites clustering according to features like island status and chromatin states. Sample embeddings effectively captured biological variations, clustering samples according to tissue types and cell lines. The model demonstrated the ability to perform zero-shot reference mapping, which allows it to transfer labels from reference datasets with known annotations to new target datasets without additional training.
CpGPT showed strong performance in imputing missing methylation data, accurately reconstructing beta values for missing probes, and improving the performance of various epigenetic clocks. Through its attention mechanism, CpGPT dynamically weights features, allowing sample-specific interpretation by assigning importance scores to each CpG site. This highlighted biologically relevant genes important for tissue-specific epigenetic regulation.
When fine-tuned for mortality prediction, CpGPT exhibited predictive performance across multiple cohorts, effectively stratifying individuals based on their biological aging profiles. It showed significant associations with mortality and morbidity outcomes, including risks for conditions such as neurodegenerative diseases, cardiovascular issues, and physical function measurements.
Conclusions
To summarize, CpGPT effectively integrates sequence context, positional information, and epigenetic state to learn rich embeddings at both the CpG site and sample levels. The model excels in tasks such as imputing missing methylation values, array conversion, zero-shot reference mapping, and predicting age and mortality. By capturing complex dependencies among CpG sites, CpGPT overcomes the limitations of traditional linear models, enhancing predictive capabilities for aging-related outcomes and disease risks across various datasets.
*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Journal reference:
- Preliminary scientific report.
CpGPT: a Foundation Model for DNA Methylation, Lucas Paulo de Lima Camillo, Raghav Sehgal, Jenel Armstrong, Albert T. Higgins-Chen, Steve Horvath, Bo Wang, bioRxiv 2024.10.24.619766; doi: 10.1101/2024.10.24.619766, https://www.biorxiv.org/content/10.1101/2024.10.24.619766v1