In a recent review published in Nature, a group of authors reviewed the progress and challenges in annotating the human genome, including protein-coding genes, isoforms, and non-coding ribonucleic acids (RNAs), and advocated for a universal annotation standard for clinical use.
Background
Initiated in 1990, the Human Genome Project sought to map human deoxyribonucleic acid (DNA) and identify all genes. Although a full DNA sequence was secured, understanding the genome's nuances has been intricate. Originally perceived as primarily a gene repository, we now recognize the genome's complex web of alternative transcripts, non-protein-coding entities, and regulatory elements. Some RNA molecules even take on roles distinct from their initial function. Further research is needed, as fully grasping the genome's multifaceted functions and elements continues to be an intricate challenge.
Understanding protein-coding genes
Launched to analyze human DNA, the Human Genome Project has made significant progress in annotating protein-coding genes. Databases like GENCODE and Reference Sequence Database (RefSeq) provide evidence for the translation and function of these genes. Advances like high-quality genome sequences from various species and mass spectrometry data bolster our confidence in the accuracy of many protein-coding genes.
Evolving estimates on gene count
After sequencing the DNA, the original mission was to document every protein-coding gene, with initial estimates ranging from 50,000 to 100,000 genes. This number gradually narrowed to just under 20,000 today, with some databases suggesting even fewer. The continuous refinement in the count is attributed to technological advances, rigorous review, and enhanced data quality. A collaboration known as Matched Annotation from NCBI and EMBL-EBI (MANE) has been instrumental in bringing clarity, with its most recent release suggesting 19,062 gene loci.
Future directions for gene annotation
Enhancing gene annotation involves investigating gene transcripts, protein structures, and transcription sites. Challenges arise from RNA-sequencing limitations and genetic variations, making accurate protein isoform counts elusive. Beyond gene identification, distinguishing pseudogenes—defective gene copies—is another hurdle. Over 14,000 pseudogenes are annotated, varying in their origins and functionalities. However, recent technological advancements suggest some may be functional, emphasizing the nuanced nature of genomic research.
Overview of non-coding RNA (ncRNA) genes
ncRNA genes encompass RNA molecules transcribed from DNA that do not translate into proteins but serve essential functions within cells. These ncRNAs can be broadly categorized into long ncRNAs (lncRNAs) with a length of at least 200 nucleotides and shorter ncRNAs, including microRNAs, small nucleolar RNAs, and others. Crucially, an RNA sequence is only regarded as an ncRNA gene if it showcases a discernible function.
Functional determination and challenges
While the roles of protein-coding genes are more readily understood, defining the functions of lncRNAs requires experimental evidence, often obtained from studies that perturb these lncRNAs and observe the resulting molecular phenotypes. However, delineating function in lncRNAs is more intricate due to their intricate mechanisms and association with retrotransposons. High-throughput RNA-seq experiments have been pivotal in identifying ncRNA genes, but many such genes display low abundance, leading to debates about their functional relevance versus being mere transcriptional noise.
ncRNA roles and annotation challenges
ncRNAs perform diverse functions, including gene regulation and DNA repair. Yet, their full scope is unclear due to limited database overlap. Annotating ncRNAs is challenging because of restricted dataset sources, overlooked RNA types, and their intricate expression patterns.
Blurring boundaries: coding vs. non-coding
The boundaries between coding and non-coding RNAs are becoming increasingly blurred. While some initially identified lncRNA to encode small peptides, some protein-coding genes produce non-coding transcript isoforms with demonstrated functionality. Moreover, long-read RNA sequencing reveals that many neighboring genes are connected by read-through transcription events, challenging traditional gene definitions.
Towards functional annotation of ncRNAs
While protein-coding genes benefit from extensive functional evidence and predictive computational methods, ncRNAs remain largely enigmatic. Current goals include documenting evidence supporting ncRNA presence, even if their function remains uncertain.
Although many ncRNAs have been briefly studied, comprehensive functional assays for the growing number of ncRNAs are needed. Unfortunately, the nomenclature of some ncRNAs, often based on adjacent protein-coding genes, can lead to misunderstandings about their actual functions.
Medical importance of gene annotation
Gene annotation is crucial for diagnosing and treating genetic diseases, with Online Mendelian Inheritance in Man (OMIM48's) catalog documenting over 5,000 genes associated with single-gene disorders. For instance, the BRCA Exchange database alone identifies over 34,000 variants in the BRCA1 gene, with 2,228 labeled pathogenic. Accurate gene and transcript models are vital in a clinical setting to assess variant pathogenicity. Errors in annotation can lead to misdiagnosis, such as the missing exons in Cyclin-Dependent Kinase-Like 5 (CDKL5) that resulted in a false-negative diagnosis.
Clinical annotation standards
Clinical labs often use RefSeq transcripts as references for reporting disease-linked gene variants, usually based on literature. This approach is inconsistent and might not best represent clinical diagnostic needs. The MANE collaboration aimed to address this by launching a universal transcript reference for every protein-coding gene. Still, there is a pressing need to include clinically important ncRNA annotations and regulatory elements in MANE. Furthermore, standardizing genetic variant descriptions ensures clearer mapping to reference genomes.
Transition to new genome references
The older hg19 (GRCh37) genome was superseded by GRCh38 in 2014. These versions differ significantly in terms of gene structure and coordinates. The recently introduced T2T-CHM13 human genome sequence offers more stability in gene coordinates. A promising approach involves creating a pan-genome that represents all human populations, enhancing consistency.
Innovations in gene analysis technologies
Innovative technologies, including long-read sequencing (like Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio), are vital for a comprehensive gene catalog, offering deeper insights into isoform expressions despite their error rate. As these technologies advance, precise transcript isoform mapping at cellular resolution becomes feasible. Additionally, capture sequencing provides enhanced coverage for specific RNAs, revolutionizing the study of low-expressed transcripts, particularly lncRNAs, enhancing our understanding of gene regulation.