In a recent study published in Nature Microbiology, researchers used shotgun sequencing to extract human reads from deoxyribonucleic acid (DNA) in fecal samples of 343 Japanese individuals comprising the main dataset of this study.
They used this gut metagenome data to reconstruct personal information. Some study participants also provided whole genome sequencing (WGS) data for ultra-deep metagenome shotgun sequencing analysis.
Study: Reconstruction of the personal information from human genome reads in gut metagenome sequencing data. Image Credit: KaterynaKon/Shutterstock.com
Background
The knowledge regarding the human microbiome, microorganisms inhabiting the human body, has expanded considerably in the last ten years, thanks to rapid advancements in technologies like metagenome shotgun sequencing.
This technology allows the sequencing of the non-bacterial component of the microbiome samples, including host DNA. For instance, in fecal samples, the amount of host DNA is less than 10% but is removed to protect the privacy of donors.
Human germline genotype in metagenome data is substantial to enable the re-identification of individuals. However, researchers and donors should recognize that it is highly confidential, so sharing it with the community requires careful consideration.
Apart from ethical concerns related to sharing this data, it is necessary to understand that if human reads in metagenome data are not removed before deposition, what kind of personal information (e.g., sex and ancestry) could this data help recover?
In addition, human reads in gut metagenome data could be a good resource for stool-based forensics, robust variant calling, and polygenic risk scores based estimates of disease risks (e.g., type 2 diabetes).
Since this data could help quantitatively and precisely reconstruct genotype information, it could complement human WGS data.
About the study
In the present study, researchers applied a few humans reads in the gut metagenome data of the main study dataset to reconstruct personal information, including genetic sex and ancestry. For predicting genetic sex and the ancestries of these 343 individuals, they used sequencing depth of the sex chromosomes and modified likelihood score-based method, respectively.
In addition, the researchers developed methods to re-identify a person from a genotype dataset. Furthermore, they combined two harmonized genotype-calling approaches, the direct calling of rare variants and the two-step imputation of common variants, to reconstruct genotypes.
The main dataset of the study included 343 Japanese participants, whereas the validation dataset for the genetic sex prediction analysis comprised 113 Japanese individuals.
The multi-ancestry dataset, which helped the researchers validate ancestry prediction analysis, comprised 73 individuals of various nationalities, including samples from individuals in New Delhi, India.
The female and male participants in each dataset were 196 & 147, 65 & 48, and 25 & 48, respectively. Likewise, the age range for these three datasets was 20 to 88, 20 to 81, and 20 to 61 years, respectively.
Results and conclusion
Given that human reads in the gut metagenome data were derived consistently from all chromosomes, the read depth of the X chromosome was nearly double in females and that of the Y chromosome in males.
So, in a logistic regression analysis, when the researchers applied a 0.43 Y:X chromosome read-depth ratio to the validation dataset, which correctly predicted the genetic sex of 97.3% of the study samples.
In human microbiome and genetic research, the feasibility of sex prediction using human gut metagenome data could help remove mislabelled samples.
The study analysis also helped researchers remarkably predict ancestry in 98.3% of individuals using 1,000 Genomes Project (1KG) data as a reference.
However, the likelihood score-based method often misclassified South Asian (SAS) samples as American (AMR) and European (EUR), especially when the number of human reads was small. It is understandable because the genetic diversity of the SAS population is complex.
The likelihood score-based method also efficiently utilized the data from genomic areas with low coverage demonstrating the quantitative power of gut metagenome data to re-identify individuals and successfully re-identified 93.3% of individuals.
Despite ethical concerns, the re-identification method used in this study could help in the quality control of multi-omics datasets comprising gut metagenome and human germline genotype data.
In addition, the authors successfully reconstructed genome-wide common variants using genomic approaches. Historically researchers used stool samples as a source of germline genomes for wild and domestic animals but not humans.
Thus, further development of suitable methodologies could help efficiently utilize the human genome in gut metagenome data and benefit animal research.
Nonetheless, the study remarkably demonstrated that optimized methods could help reconstruct personal information from the human reads in gut metagenome data.
Moreover, the findings of this study could serve as a guiding resource to devise best practices for using the already accumulated gut metagenome data of humans.