Georgia Tech researchers, working with colleagues in the National Center for Biotechnology Information (NCBI), have released a new version of a genome annotation system capable of analyzing more than 2,000 prokaryotic genomes per day, helping researchers accelerate prokaryotic genomics-based studies worldwide.
In biology, prokaryote generally describes a microorganism that lacks a distinct membrane-bound nucleus and has its genetic material contained in a single molecule of DNA. These include bacteria and archaea.
The NCBI operates the Prokaryotic Genome Annotation Pipeline, a high- performance software system designed to analyze gene sequences of these microorganisms. As more high-quality genomes become available - and as the cost of sequencing continues to fall - the need for high-throughput analysis and annotation pipelines cannot be overstated.
The latest advance comes as the NCBI incorporates Georgia Tech's GeneMarkS+ into the PGAP system. Developed by Mark Borodovsky's team at Georgia Tech, GeneMarkS+ is a self-training machine learning tool for novel gene identification that can combine intrinsic evidence revealed by genomic sequence patterns with extrinsic evidence derived from already annotated genomes.
"The new system enables researchers to get critically important analysis that consistently integrates information of all sources of evidence nearly in real time instead of days and weeks," said Borodovsky, a Regents' professor with a joint appointment in the School of Computational Science and Engineering and the Coulter Department of Biomedical Engineering. "Our group is excited to be a part of the whole team working on this project with high international visibility."
Before implementing GeneMark+ into the pipeline, the system could handle only 20 annotations daily.
"Dr. Borodovsky worked closely with Tatiana Tatusova's team at NCBI to incorporate and refine GeneMarkS+ in the context of the NCBI annotation pipeline," said Jim Ostell, chief of NCBI's Information Engineering Branch. "It provides a critical core infrastructure to NCBI and to users of NCBI resources."
PGAP uses GeneMarkS+ in conjunction with proteomic evidence obtained from large groups of orthologous gene clusters representing the core protein complement for well-annotated species. As new organisms are sequenced, PGAP adjusts by mining the existing protein information to build new core protein clusters, iteratively improving its annotation based on the ever-increasing wealth of available evidence from submitted bacterial genomes.
The new system offers a modular structure, permitting easy extension with new algorithms. PGAP also provides extensive tracking of execution and decision making, and thus permits an easy trace-back to understand the evidence behind key algorithmic decisions. The PGAP process is described at
http://www.ncbi.nlm.nih.gov/genome/annotation_prok/process/
PGAP produces high-quality annotation designed to meet INSDC standards for sequence submission and follows UniProt naming guidelines. PGAP is available at NCBI for bacterial genomes as part of GenBank sequence submission, making it a valuable resource to researchers worldwide.