Scientists reveal global catalog of microbial small proteins, unlocking microbiome secrets

Study: A catalog of small proteins from the global microbiome. Image Credit: Pakpoom Nunjui / Shutterstock

Study: A catalog of small proteins from the global microbiome. Image Credit: Pakpoom Nunjui / Shutterstock

Mapping the hidden world: Discover how this groundbreaking catalog of nearly one billion small proteins is set to transform our understanding of microbial life.

In a recent study published in the journal Nature Communications, researchers analyzed data from more than 63,000 metagenomes and almost 88,000 isolate genomes to construct a novel global microbial small open reading frames (smORFs) catalog (GMSC). The catalog leverages cutting-edge proteogenomics and comparative genomics techniques to comprehensively annotate more than 964 million non-redundant smORFs across 75 habitats, a scale approximately ~20-fold greater than any previous smORF work.

Researchers further developed and published a publicly available identification and annotation tool named ‘GMSC-mapper,’ enabling future studies to characterize their microbial metagenomic datasets rapidly and with substantially enhanced accuracy than previously possible. Finally, this study identifies that archaea contain a significantly higher proportion of smORFs than bacteria, suggesting a more complex role of small proteins in archaeal biology and highlighting the substantial small protein diversity in microbiome ecology.

Background

Small open reading frames (smORFs) are short (<100 codons) stretches of DNA that occur frequently across genomes and may encode putative peptides. They are found across all three domains of organisms and are estimated to constitute between 5 and 10% of all annotated genes. Previously dismissed as comprising non-functional ‘junk’ DNA, a growing body of early prediction models and recent studies reveals their extensive biological roles in stress responses, gene expression, housekeeping functions, signal pathways, antimicrobial activities, and photosynthesis, particularly in microorganisms.

Unfortunately, conventional protein discovery techniques face substantial challenges in harnessing genomic data to reliably identify and characterize smORFs, resulting in their widespread neglect in microbiome metagenomic investigations. Recent advances in high-throughput comparative genomics, Ribo-Seq, and proteogenomics have addressed the technical aspects of these challenges. Still, the sheer number of potential smORFs and the potential for false-positive smORF predictions has previously restricted the development of a global smORF database, hampering microbiome-associated research efforts.

“…most of the studies focusing on smORFs approach isolated microorganisms and specific environments. The functional and ecological understanding of microbial smORFs at a global scale across different habitats is still very limited.”

About the study

The present study applies the principle of ‘repeated independent observations’ of highly similar smORF-derived putative peptides to theoretically minimize false-positive smORF predictions, allowing for the development of a global microbial smORF catalog (GMSC). Data for the study was derived from the SPIRE database (63,410 assembled metagenomes) and the ProGenomes2 database (87,920 isolate genomes).

Identified reads ≥60 base pairs (bp) were assembled into contigs using the MEGAHIT 1.2.9 software. These contigs were subsequently passed through a modified Prodigal algorithm to identify smORFs. Putative smORFs were tagged with their habitat microontology (8 categories) using the SPIRE database and their geographic ranges using the GeoPandas platform.

The heuristic Linclust algorithm was then used to construct a non-redundant smORF catalog using a hierarchical clustering approach, thereby identifying single-sequence clusters (singletons). To validate these clusters and prevent smORF duplications, researchers carefully estimated rates of false negative singletons, allowing for those that comprised biologically meaningful homologous sequences. Finally, to test the quality of identified smORF, research carried out extensive in silico quality testing (QC) and cross-referenced obtained results with preexisting protein sequence databases (RefSeq and human microbiome small protein family datasets). smORFs that passed all QCs were labeled ‘high quality’.

To enhance the utility and user-friendliness of the catalog, researchers developed a characterization and annotation tool named ‘GMSC-mapper.’ The tool can scan a presented metagenome and automatically identify and annotate small proteins (putative peptides) from within the metagenomic dataset. To validate and demonstrate the utility of the resultant catalog and tool, researchers analyzed archaeal and bacterial metagenomes from RefSeq. They used their novel tool to compare the densities of smORFs across these two domains of life.

Study findings

Initial results from the Prodigal algorithm identified 2.72 billion potential smORFs, of which 84.7% were classified as ‘singletons.’ Subsequent false-positive screening analysis curtailed these putative smORFs to 964,970,496 smORFs, comprising the GMSC catalog.

Notably, despite this nearly one billion-strong smORF catalog being ~20-fold larger than previously identified, rarefaction analysis suggests that this represents only a fraction of globally available smORF diversity.

In silico QC and additional database genomic prediction matching revealed 43,642,695 (4.5%) of the GMSC database as ‘high quality.’ Each high-quality prediction was labeled with comprehensive annotations such as taxonomy, habitats, and (if available) biological function.

“To assess the comprehensiveness of our catalog, we matched small proteins encoded by GMSC smORFs to the RefSeq database and previously published human microbiome small protein family datasets. Only 5.3% of smORFs in our catalog are homologous to these previously reported small proteins. On the other hand, our catalog contains more than 80% of these reference datasets.”

GMSC-mapper-based smORF density comparisons revealed that archaea contain substantially higher proportions of smORFs than bacteria despite significantly lower sampling (18 archaeal phyla versus 131 bacterial phyla). This discovery raises intriguing questions about small proteins' functional diversity and evolutionary significance in archaea. Unfortunately, given the limitations of the current archaeal metagenomic literature, predictions of the biological functions of smORFs in these lifeforms could not be sufficiently verified.

Conclusions

The present study presents the development of the first global microbial small open reading frames catalog named GMSC version 1 (GMSCv1). The catalog comprises almost 1 billion predicted smORFs, a ~20-fold increase over previously known. Of these, 43 million smORFs were QC verified to be ‘high quality,’ all of which have been comprehensively annotated with their respective taxon, potential biological function, geography, and habitat.

Researchers additionally developed and validated an automated annotation tool (GMSC-mapper) capable of screening a (meta)genomic dataset and efficiently characterizing the diversity of smORFs within.

Together, this study's publicly available outcomes provide microbiome researchers with unprecedented data access, allowing for a new era in the severely underexplored field of small protein discovery.

Journal reference:
Hugo Francisco de Souza

Written by

Hugo Francisco de Souza

Hugo Francisco de Souza is a scientific writer based in Bangalore, Karnataka, India. His academic passions lie in biogeography, evolutionary biology, and herpetology. He is currently pursuing his Ph.D. from the Centre for Ecological Sciences, Indian Institute of Science, where he studies the origins, dispersal, and speciation of wetland-associated snakes. Hugo has received, amongst others, the DST-INSPIRE fellowship for his doctoral research and the Gold Medal from Pondicherry University for academic excellence during his Masters. His research has been published in high-impact peer-reviewed journals, including PLOS Neglected Tropical Diseases and Systematic Biology. When not working or writing, Hugo can be found consuming copious amounts of anime and manga, composing and making music with his bass guitar, shredding trails on his MTB, playing video games (he prefers the term ‘gaming’), or tinkering with all things tech.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Francisco de Souza, Hugo. (2024, September 02). Scientists reveal global catalog of microbial small proteins, unlocking microbiome secrets. News-Medical. Retrieved on December 11, 2024 from https://www.news-medical.net/news/20240902/Scientists-reveal-global-catalog-of-microbial-small-proteins-unlocking-microbiome-secrets.aspx.

  • MLA

    Francisco de Souza, Hugo. "Scientists reveal global catalog of microbial small proteins, unlocking microbiome secrets". News-Medical. 11 December 2024. <https://www.news-medical.net/news/20240902/Scientists-reveal-global-catalog-of-microbial-small-proteins-unlocking-microbiome-secrets.aspx>.

  • Chicago

    Francisco de Souza, Hugo. "Scientists reveal global catalog of microbial small proteins, unlocking microbiome secrets". News-Medical. https://www.news-medical.net/news/20240902/Scientists-reveal-global-catalog-of-microbial-small-proteins-unlocking-microbiome-secrets.aspx. (accessed December 11, 2024).

  • Harvard

    Francisco de Souza, Hugo. 2024. Scientists reveal global catalog of microbial small proteins, unlocking microbiome secrets. News-Medical, viewed 11 December 2024, https://www.news-medical.net/news/20240902/Scientists-reveal-global-catalog-of-microbial-small-proteins-unlocking-microbiome-secrets.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.