In a recent article published in Nature, researchers analyzed 150,119 genome sequences from the United Kingdom Biobank (UKB).
Background
A thorough and accurate characterization of both sequences and phenotypic variation is necessary for a detailed comprehension of how variations in the human genome's sequence influence phenotypic diversity. Insights into this association have been discovered during the past ten years using whole-genome sequencing (WGS) or whole-exome sequencing (WES) of sizable cohorts featuring rich phenotypic information.
With a healthy participant bias, the UKB records the phenotypic diversity of 500,000 people throughout the UK. The UKB WGS consortium sequences each participant's whole genome to a median depth of at least 23.5 base pairs.
About the study
In the present study, the researchers explain the WGS analysis of 150,119 UKB participants. From the group of UKB volunteers, individuals were chosen pseudo-random and split across the two sequencing sites. The authors stated that through a depletion rank (DR) score of windows spanning the genome, this extensive database of variants enables the assessment of selection based on sequence diversity inside a population.
Overall, the study report on the initial data release contains a sizable collection of sequence variants centering on the WGS of 150,119 people, including short insertions or deletions (indels), single-nucleotide polymorphisms (SNPs), structural variants (SVs), and microsatellites.
Each variant call was conducted jointly across all participants to provide an accurate data comparison. The resulting dataset offered a rare chance to research human sequence diversity and how it affects phenotypic variation.
Besides, the team outlines some of the discoveries made possible by this enormous new WGS data resource that would be difficult or impossible to make using WES and SNP array datasets.
Results
The researchers noted that the dataset generated by sequencing the entire genomes of over 150,000 UKB participants was unmatched in scale, offering the most thorough analysis of the sequence heterogeneity in a single population's germline genomes thus far.
The team provided two pairs of variant classes often not examined in genome-wide association studies (GWAS), i.e., 1) indel and SNP data and 2) SV and microsatellite data, identifying a large number of sequence variants among the WGS participants. This group comprises a range of high-quality variants consisting of 58,707,036 indels and 585,040,410 SNPs, which make up 7% of all potential human SNPs.
The DR examination reveals that coding exons only make up a small portion of the genome's areas prone to significant sequence conservation. The authors identified three cohorts under the UKB: a smaller African cohort, a South Asian cohort, and a sizable British Irish cohort.
The study provides a haplotype reference panel, which facilitates an accurate imputation of most variants harbored by three or more sequenced subjects. The team discovered two types of variants ordinarily left out of extensive WGS analyses, i.e., 2,536,688 microsatellites and 895,055 SVs.
Compared to the WES of the same individuals, the number of indels and SNPs was 40-fold higher. Even inside identified coding exons, WES missed 10.7% of variants discovered by WGS. The majority of the remaining genome was not covered by WES, including untranslated regions (UTRs), functionally significant promoter regions, and unannotated exons. The identification of uncommon non-coding sequence variants with drastic impacts on menarche and height versus any variants revealed in GWAS to date serves as an illustration of the significance of these regions.
Conclusions
The current research offers numerous instances of trait relationships for uncommon variants with profound impacts using this powerful new resource from WGS that was not discovered previously via investigations based on WES or imputation.
Collectively, the scientists anticipate that the DR score discussed in the paper will be a valuable tool for recognizing genomic areas of functional significance. Nevertheless, additional studies are warranted to fully understand its characteristics, implications, and how it contrasts with other conservation and sequence restriction metrics.
While coding exons were subjected to strong purifying selection, as depicted by a low DR value, they constitute only a negligible portion of the areas with a low DR value. The authors mentioned that the present research's description of extensive sequencing, and ongoing efforts to sequence the whole UKB, are expected to significantly advance knowledge of the role and relevance of the non-coding genome.
The current findings should considerably improve the comprehension of the connection between phenotypic variety and human genome variability when coupled with the in-depth analysis of phenotypic variation throughout the UKB.