RhoFold+ delivers a leap in RNA 3D structure prediction, combining speed and accuracy to tackle data scarcity and unlock new possibilities in drug development and synthetic biology.
Study: Accurate RNA 3D structure prediction using a language model-based deep learning approach. Image Credit: Christoph Burgstedt / Shutterstock
In a recent study published in the journal Nature Methods, a group of researchers developed a novel method called Ribonucleic Acid (RNA) High-Order Folding Prediction Plus (RhoFold+). This deep learning-based method utilizes an RNA language model to accurately predict RNA 3D structures. This method addresses the challenges of RNA's intrinsic structural flexibility and the scarcity of experimentally determined data.
Background
RNA molecules play a central role in molecular biology, influencing gene regulation and serving as promising targets for drug development and synthetic biology. Despite the importance of RNA structure in understanding function, the majority of RNA molecules remain structurally uncharacterized, with less than 1% of RNA-only structures available in the Protein Data Bank (PDB) as of December 2023. Experimental techniques like X-ray crystallography, Nuclear Magnetic Resonance (NMR), and Cryogenic Electron Microscopy (cryo-EM) are constrained by specialized requirements, while computational methods, including template-based and de novo approaches, face challenges like data scarcity and computational intensity. The development of RhoFold+ represents a critical step in addressing these challenges and achieving a balance between speed, accuracy, and accessibility in RNA structure prediction.
About the study
The RhoFold+ platform integrates advanced methodologies for RNA 3D structure prediction, combining both Multiple Sequence Alignment (MSA)-based and deep learning approaches to improve accuracy and efficiency. MSAs, generated using Infernal and Recursive MSA (rMSA) tools, capture conserved secondary structures from databases like the RNA Families Database (Rfam) and the RNA Central Database (RNAcentral). To manage memory constraints, 256 MSAs were selected, either randomly or via clustering, and used as input for standard predictions or optimized high-confidence models referred to as RhoFold+ (TopK).
Central to RhoFold+ is the RNA Foundation Model (RNA-FM), built on a transformer architecture inspired by Bidirectional Encoder Representations from Transformers (BERT). Pretrained on ~23.7 million non-coding (nc)RNA sequences from RNAcentral, RNA-FM effectively captured sequence dependencies through masked token prediction. A self-distillation dataset, combining experimental annotations with pseudo-structural labels, further enhanced the model's accuracy. Postprocessing with tools like Assisted Model Building with Energy Refinement (AMBER) resolved structural inaccuracies, ensuring biologically valid predictions.
RhoFold+'s structure module uses geometric modeling and iterative recycling to predict 3D coordinates while enforcing biological constraints. The use of multi-level loss functions helps optimize structural predictions across multiple dimensions, further refining accuracy. Benchmarking against methods like DeepFoldRNA and AlphaFold3 on Critical Assessment of Structure Prediction (CASP15) targets demonstrated RhoFold+'s superior performance and rapid predictions, leveraging only RNA sequence input and achieving accurate results across diverse RNA structures.
Study results
The development of RhoFold+ represents a significant advancement in RNA 3D structure prediction by addressing the limitations of existing datasets and computational approaches. A curated dataset of single-chain RNA sequences was prepared using representative RNA structures from the PDB, clustered at 80% sequence similarity. This process resulted in 782 unique sequence clusters from 5,583 RNA chains, which were processed through RhoFold+. The RNA-FM language model was employed to extract evolutionary and structural embeddings, while MSAs generated from extensive sequence databases were incorporated into Rhoformer for iterative refinement. Key structural constraints, including secondary structure and base pairing, ensured the generation of biologically accurate models.
RhoFold+ underwent rigorous benchmarking against established methods on community challenges like RNA-Puzzles and CASP15. In RNA-Puzzles, RhoFold+ outperformed all other approaches on most targets, achieving an average root-mean-square deviation (r.m.s.d.) of 4.02 Å, a substantial improvement over the second-best method. Template Modeling (TM) scores also demonstrated superior global structural alignment, confirming the model's accuracy. Notably, RhoFold+ performed consistently well even when tested on datasets with minimal sequence and structural overlap with the training data, underscoring its robustness and generalization capabilities. Comparisons with the best single templates further validated RhoFold+'s capacity to produce predictions exceeding those derived from the most structurally similar training models.
On CASP15 natural RNA targets, RhoFold+ surpassed other leading methods, including expert-driven approaches, achieving notable accuracy improvements. Its predictions consistently exhibited lower r.m.s.d. values and higher Z-scores for structural alignment metrics like TM score and Global Distance Test-Total Score (GDT-TS). Even in challenging scenarios, such as predicting complex secondary and tertiary interactions, RhoFold+ demonstrated strong performance.
A comprehensive evaluation across all experimentally determined RNA structures showed that RhoFold+ demonstrated high cross-validation performance. It maintained consistent accuracy across different data splits and generalized to unseen RNA structures, including new PDB entries. However, challenges remain in predicting RNA junctions and pseudoknots, which exhibit significant conformational flexibility.
RhoFold+ extends its utility beyond 3D structure prediction by accurately predicting RNA secondary structures and Interhelical Angles (IHAs). This expanded functionality highlights its potential applications in RNA engineering and functional studies, such as in synthetic biology.
Conclusions
To summarize, RhoFold+ integrates an RNA language model pre-trained on ~23.7 million RNA sequences and incorporates strategies to augment limited training data. RhoFold+ outperforms other RNA structure prediction methods, achieving sub-4 Å r.m.s.d. on CASP15 RNA targets and RNA-Puzzles. It is fast, efficient, and does not require expert knowledge. Additionally, the model excels in handling diverse RNA types and families, validating its potential for broad application.
The model generalizes well across different training sets and accurately predicts unseen RNA structures in cross-family and cross-type validations. While challenges remain in predicting large, complex RNA structures, RhoFold+ represents a transformative step in RNA 3D structure prediction, bridging the gap between accuracy, speed, and accessibility.