Study: A catalog of small proteins from the global microbiome. Image Credit: Pakpoom Nunjui / Shutterstock
Mapping the hidden world: Discover how this groundbreaking catalog of nearly one billion small proteins is set to transform our understanding of microbial life.
In a recent study published in the journal Nature Communications, researchers analyzed data from more than 63,000 metagenomes and almost 88,000 isolate genomes to construct a novel global microbial small open reading frames (smORFs) catalog (GMSC). The catalog leverages cutting-edge proteogenomics and comparative genomics techniques to comprehensively annotate more than 964 million non-redundant smORFs across 75 habitats, a scale approximately ~20-fold greater than any previous smORF work.
Researchers further developed and published a publicly available identification and annotation tool named ‘GMSC-mapper,’ enabling future studies to characterize their microbial metagenomic datasets rapidly and with substantially enhanced accuracy than previously possible. Finally, this study identifies that archaea contain a significantly higher proportion of smORFs than bacteria, suggesting a more complex role of small proteins in archaeal biology and highlighting the substantial small protein diversity in microbiome ecology.
Background
Small open reading frames (smORFs) are short (<100 codons) stretches of DNA that occur frequently across genomes and may encode putative peptides. They are found across all three domains of organisms and are estimated to constitute between 5 and 10% of all annotated genes. Previously dismissed as comprising non-functional ‘junk’ DNA, a growing body of early prediction models and recent studies reveals their extensive biological roles in stress responses, gene expression, housekeeping functions, signal pathways, antimicrobial activities, and photosynthesis, particularly in microorganisms.
Unfortunately, conventional protein discovery techniques face substantial challenges in harnessing genomic data to reliably identify and characterize smORFs, resulting in their widespread neglect in microbiome metagenomic investigations. Recent advances in high-throughput comparative genomics, Ribo-Seq, and proteogenomics have addressed the technical aspects of these challenges. Still, the sheer number of potential smORFs and the potential for false-positive smORF predictions has previously restricted the development of a global smORF database, hampering microbiome-associated research efforts.
“…most of the studies focusing on smORFs approach isolated microorganisms and specific environments. The functional and ecological understanding of microbial smORFs at a global scale across different habitats is still very limited.”
About the study
The present study applies the principle of ‘repeated independent observations’ of highly similar smORF-derived putative peptides to theoretically minimize false-positive smORF predictions, allowing for the development of a global microbial smORF catalog (GMSC). Data for the study was derived from the SPIRE database (63,410 assembled metagenomes) and the ProGenomes2 database (87,920 isolate genomes).
Identified reads ≥60 base pairs (bp) were assembled into contigs using the MEGAHIT 1.2.9 software. These contigs were subsequently passed through a modified Prodigal algorithm to identify smORFs. Putative smORFs were tagged with their habitat microontology (8 categories) using the SPIRE database and their geographic ranges using the GeoPandas platform.
The heuristic Linclust algorithm was then used to construct a non-redundant smORF catalog using a hierarchical clustering approach, thereby identifying single-sequence clusters (singletons). To validate these clusters and prevent smORF duplications, researchers carefully estimated rates of false negative singletons, allowing for those that comprised biologically meaningful homologous sequences. Finally, to test the quality of identified smORF, research carried out extensive in silico quality testing (QC) and cross-referenced obtained results with preexisting protein sequence databases (RefSeq and human microbiome small protein family datasets). smORFs that passed all QCs were labeled ‘high quality’.
To enhance the utility and user-friendliness of the catalog, researchers developed a characterization and annotation tool named ‘GMSC-mapper.’ The tool can scan a presented metagenome and automatically identify and annotate small proteins (putative peptides) from within the metagenomic dataset. To validate and demonstrate the utility of the resultant catalog and tool, researchers analyzed archaeal and bacterial metagenomes from RefSeq. They used their novel tool to compare the densities of smORFs across these two domains of life.
Study findings
Initial results from the Prodigal algorithm identified 2.72 billion potential smORFs, of which 84.7% were classified as ‘singletons.’ Subsequent false-positive screening analysis curtailed these putative smORFs to 964,970,496 smORFs, comprising the GMSC catalog.
Notably, despite this nearly one billion-strong smORF catalog being ~20-fold larger than previously identified, rarefaction analysis suggests that this represents only a fraction of globally available smORF diversity.
In silico QC and additional database genomic prediction matching revealed 43,642,695 (4.5%) of the GMSC database as ‘high quality.’ Each high-quality prediction was labeled with comprehensive annotations such as taxonomy, habitats, and (if available) biological function.
“To assess the comprehensiveness of our catalog, we matched small proteins encoded by GMSC smORFs to the RefSeq database and previously published human microbiome small protein family datasets. Only 5.3% of smORFs in our catalog are homologous to these previously reported small proteins. On the other hand, our catalog contains more than 80% of these reference datasets.”
GMSC-mapper-based smORF density comparisons revealed that archaea contain substantially higher proportions of smORFs than bacteria despite significantly lower sampling (18 archaeal phyla versus 131 bacterial phyla). This discovery raises intriguing questions about small proteins' functional diversity and evolutionary significance in archaea. Unfortunately, given the limitations of the current archaeal metagenomic literature, predictions of the biological functions of smORFs in these lifeforms could not be sufficiently verified.
Conclusions
The present study presents the development of the first global microbial small open reading frames catalog named GMSC version 1 (GMSCv1). The catalog comprises almost 1 billion predicted smORFs, a ~20-fold increase over previously known. Of these, 43 million smORFs were QC verified to be ‘high quality,’ all of which have been comprehensively annotated with their respective taxon, potential biological function, geography, and habitat.
Researchers additionally developed and validated an automated annotation tool (GMSC-mapper) capable of screening a (meta)genomic dataset and efficiently characterizing the diversity of smORFs within.
Together, this study's publicly available outcomes provide microbiome researchers with unprecedented data access, allowing for a new era in the severely underexplored field of small protein discovery.