In a recent study published in Nature, researchers in the United States aggregated and processed 76,156 human genomes to construct a genomic constraint map named "genomic non-coding constraint of haploinsufficient variation" (Gnocchi) for the whole genome. They found that non-coding constrained regions in the genome were rich in known regulatory elements and variants linked to human traits and diseases. The map could be helpful in improving our understanding of functional genetic variation in the human genome.
Study: A genomic mutational constraint map using variation in 76,156 human genomes. Image Credit: Gio.tto / Shutterstock
Background
Advancements in human genomic sequencing provide insights into variation patterns in genes, allowing the direct assessment of negative selection on missense and loss-of-function (LOF) variation through constraint modeling. Here, constraint is defined as the reduction of variation in a gene relative to an expectation based on the gene's mutability. Previous efforts focused on coding regions that represent less than 2% of the genome. As a result, the extensive non-coding genome remains less explored despite its recognized significance in complex human diseases. Applying the gene constraint model to non-coding regions faces challenges due to limited whole-genome data, lack of nucleotide-specific models, overrepresentation of coding regions in mutation analyses, and the complex, heterogeneous mutation rate influenced by local and larger-scale genomic features.
The current methods for evaluating non-coding region constraints include context-dependent mutational models, machine learning classifiers, and phylogenetic conservation scores. However, they have limitations— overlooking regional genomic features, dependency on well-characterized mutations, and a reduced power to detect recently selected regions with functional effects on human-specific diseases or traits. Addressing this need, researchers in the present study developed a genome-wide constraint map to identify functional genomic elements (especially in the non-coding space) that are likely to accumulate variation and have potential clinical implications. The map also offers insights into the impact of natural selection on human genetic variation.
About the study
The present study aggregated and reprocessed 153,030 whole genomes from the Genome Aggregation Database (gnomAD) and aligned them to the human genome reference build GRCh38. Ultimately, 76,156 high-quality samples were retained from healthy, unrelated individuals with diverse ancestries. The study identified and used 390,393,900 low-frequency, high-quality single nucleotide variants to construct the genome-wide constraint map. The genome was segmented into continuous, non-overlapping windows of size 1 kb. Constraint was quantified for each window by comparing the observed and the expected variation. A refined mutational model was used, which combined trinucleotide sequence context, regional genomic features, and base-level methylation to predict expected variation levels under neutrality. The deviation between the expected and observed variation was quantified using a "Gnocchi score." The correlation between the Gnocchi metric and various annotations of functional non-coding sequences was determined for validation. The ability of the Gnocchi score to prioritize non-coding variants was compared with other population genetics-based metrics, including Orion, CDTS (short for context-dependent tolerance score), gwRVIS (short for genome-wide residual variation intolerance score), and depletion rank, by measuring the area under the curve statistic. Further, the constraint for enhancers linked to specific genes was analyzed.
Results and discussion
The Gnocchi score was found to be close to zero for non-coding regions and significantly higher for windows containing coding sequences. About 3.12% and 0.05% of the non-coding windows showed constraint as strong as the 50th and 90th percentile of exonic regions, respectively. A significant positive correlation was found between constraint and functional non-coding annotations, demonstrating the utility of the Gnocchi score in characterizing non-coding regions and providing additional insights. The Gnocchi score was found to perform well against other non-coding metrics, effectively identifying functional variants in the non-coding genome. However, the researchers suggest a combination of metrics would be ideal for prioritizing functional variation. The Gnocchi metric was also found to be useful in prioritizing copy-number variants (CNVs), aiding the interpretation of non-coding risk factors in studies that associate CNVs with diseases. As per the study, enhancers linked to constrained genes were found to be significantly more constrained than those linked to presumably less constrained genes. Further, the study emphasizes the value of non-coding constraint as a complementary metric to gene constraint for identifying functionally important genes.
Although the biological impact of mutations in enhancers is less understood, the researchers suggest that there is potential for an extended model to provide biologically informed insights into non-coding variation and molecular mechanisms of selection. While the study utilizes one of the most extensive datasets of human genomes for the analysis of non-coding constraint, the power and resolution of the approach may significantly improve with an increase in sample size.
Conclusion
In summary, the present study highlights the significance of the genome-wide constraint map in analyzing non-coding regions and protein-coding genes. It marks a crucial advancement towards developing an inclusive catalog of functional elements in the human genome, prompting further research in the area.