Deep generative models to generate hypothetical SARS-CoV-2 spike sequences

Scientists at the University of Illinois at Urbana-Champaign have developed deep generative models to predict undiscovered sequences of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike protein. These hypothetical sequences could be useful for future pandemic preparedness. The study is currently available on the bioRxiv* preprint server.

Study: PandoGen: Generating complete instances of future SARS-CoV2 sequences using Deep Learning. Image Credit: TimeStopper69 / ShutterstockStudy: PandoGen: Generating complete instances of future SARS-CoV2 sequences using Deep Learning. Image Credit: TimeStopper69 / Shutterstock

Background

Deep generative models are used to generate complete and realistic samples of different objects, such as images, language pieces, and computer codes. Among these models, Large Language Models (LLMs) have recently gained immense popularity because of their ability to follow human instructions and perform competitive programming at the human level.

Protein Language Models (PLMs) are based on LLM designs and can model biological sequences and generate samples with interesting properties.

In the current study, scientists explored novel methods to train a PLM to generate complete, self-contained, realistic, and not-yet-known samples of SARS-CoV-2 spike sequences. In general, LLMs are trained using a known data set to parameterize the probability distribution of the targeted data.

The scientists primarily focused on the SARS-CoV-2 spike protein because of its significant involvement in the viral entry process and ability to induce host immune responses. The spike protein initiates SARS-CoV-2 entry into host cells by interacting with the host cell membrane receptor angiotensin-converting enzyme 2 (ACE2).

Many therapeutic and preventive interventions targeting the spike protein have been developed during the coronavirus disease 2019 (COVID-19) pandemic, including therapeutic monoclonal antibodies and COVID-19 vaccines. Thus, advance knowledge of future spike protein sequences would be helpful for developing novel variant-specific vaccines and monoclonal antibodies.

Important observations

The scientists developed a deep generative model, PandoGen, and trained the model using spike sequences that were deposited in the GISAID (the Global Initiative on Sharing All Influenza Data) database on or before June 15, 2021. Model generation is benchmarked against sequences reported after this date.

The model's functional validation revealed that PandoGen can generate high-quality sample sequences of the spike protein that are significantly different from the training sequences. This could be because the model has explicit training constructs that prevent it from regenerating the training sequences and force it to generate sample sequences with significant differences.

The comparison of model-generated sample sequences with GISAID-derived sequences revealed PandoGen is capable of generating a high fraction of real sequences. The model also showed proficiency in generating novel sequences associated with GISAID cases.

Study significance

The study describes the development of a new method that can train deep-generating models to generate hypothetical SARS-CoV-2 spike sequences that are not yet discovered but have the potency to create future pandemics. The training pipeline used in the study utilizes information that is available in GISAID and does not require any additional laboratory experiments for sequence characterization.  

Comparison of the novel PandoGen model with a standard model reveals that the new model has higher proficiency than the standard model in generating a high fraction of real, salient, and novel sequences. Specifically, the new model outperforms the standard by 4 times for the number of novel sequences and almost 10 times for case counts of the generated corpus. Moreover, the study finds that about 70% of higher-ranked sequences generated by the model are discovered in the future.

As mentioned by the scientists, the study model can be used as a promising platform for generating hypothetical SARS-CoV-2 spike sequences using publicly available resources. In addition, the information obtained from the model could be useful for advance preparation against future pandemic situations.

Journal reference:
Dr. Sanchari Sinha Dutta

Written by

Dr. Sanchari Sinha Dutta

Dr. Sanchari Sinha Dutta is a science communicator who believes in spreading the power of science in every corner of the world. She has a Bachelor of Science (B.Sc.) degree and a Master's of Science (M.Sc.) in biology and human physiology. Following her Master's degree, Sanchari went on to study a Ph.D. in human physiology. She has authored more than 10 original research articles, all of which have been published in world renowned international journals.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Dutta, Sanchari Sinha Dutta. (2023, May 14). Deep generative models to generate hypothetical SARS-CoV-2 spike sequences. News-Medical. Retrieved on November 14, 2024 from https://www.news-medical.net/news/20230514/Deep-generative-models-to-generate-hypothetical-SARS-CoV-2-spike-sequences.aspx.

  • MLA

    Dutta, Sanchari Sinha Dutta. "Deep generative models to generate hypothetical SARS-CoV-2 spike sequences". News-Medical. 14 November 2024. <https://www.news-medical.net/news/20230514/Deep-generative-models-to-generate-hypothetical-SARS-CoV-2-spike-sequences.aspx>.

  • Chicago

    Dutta, Sanchari Sinha Dutta. "Deep generative models to generate hypothetical SARS-CoV-2 spike sequences". News-Medical. https://www.news-medical.net/news/20230514/Deep-generative-models-to-generate-hypothetical-SARS-CoV-2-spike-sequences.aspx. (accessed November 14, 2024).

  • Harvard

    Dutta, Sanchari Sinha Dutta. 2023. Deep generative models to generate hypothetical SARS-CoV-2 spike sequences. News-Medical, viewed 14 November 2024, https://www.news-medical.net/news/20230514/Deep-generative-models-to-generate-hypothetical-SARS-CoV-2-spike-sequences.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Genetic risk factors for long-COVID uncovered in a large multi-ethnic study