In a recent study published in PLOS ONE, researchers developed an integrated approach combining next-generation sequencing (NGS), molecular barcoding, machine learning, and bioinformatics to enable high-throughput detection of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants.
Background
SARS-CoV-2 amino acid substitutions (mutations) give rise to different variants with increased virulence and/or resistance to coronavirus disease 2019 (COVID-19) vaccines. Reverse transcription-polymerase chain reaction (RT-PCR) has been the gold standard for molecular detection; however, the method does not enable the identification of sequence variations in specific genomic locations.
Identifying SARS-CoV-2 infections and single nucleotide variants (SNVs) requires modification of the existing diagnostic techniques for mapping SARS-CoV-2 mutations in a rapid, reliable, real-time, and cost-effective manner. NGS enables variant analysis and lineage tracking; however, it has not yet been considered a standard method for mass screening. The integration of PCR and NGS offers several benefits such as large-scale testing, less cost, lower quantity of reagents required, and the readout of SARS-CoV-2 sequence variations.
About the study
In the present study, researchers detected SARS-CoV-2 variants using a protocol based on integrating multiplexed PCR, deoxyribonucleic acid (DNA)-barcoding, sample pooling, NGS, machine learning, and bioinformatics analysis at a single nucleotide resolution. While PCR enables SARS-CoV-2 detection, NGS enables detection of sequence variations, machine learning improves the sensitivity and specificity of the technique, and bioinformatics enables data analysis.
Oropharyngeal and nasopharyngeal swabs were obtained from the patients (n = 960 specimens), from which RNA was extracted and subjected to RT-PCR analysis, and amplified complementary DNA (cDNA) of SARS-CoV-2 with >1 SNV in the sequence reads were generated. Subsequently, DNA -barcoding, sample pooling, library preparation, NGS-based amplicon sequencing, and machine learning analyses were performed.
The method enabled individually barcoded samples to be pooled together in one well and amplification of multiple fragments in parallel for processing thousands of samples simultaneously. A total of 2133 band 21,000 barcodes with 10 nitrogenous bases and 12 nitrogenous bases, respectively, were generated; however, only 96 distinct barcodes were selected for the analysis, and the viral reads were counted for every barcode.
For barcoding, patient-specific barcodes were generated >3 sequence-Levenshtein distance apart and added to DNA primers. The primer targets for the analysis were the SARS-CoV-2 nucleocapsid 1 (N1), N2, envelope (E), and open reading frame 1 (ORF1) genes. In addition, the human endogenous ribonuclease P (RNaseP) gene was used as an internal control.
In the analysis, 10 genetic libraries were prepared and sequenced to identify SARS-CoV-2 and its variants in the samples, along with sequence variations. The sequences were read using the Illumina NGS system to identify SARS-CoV-2-positive samples and their sequence variations.
Results
Three viral fragments were sequenced for SARS-CoV-2 detection, and seven single nucleotide SARS-CoV-2 variants were detected after NGS-based sequencing. The observed mutations were compared to SARS-CoV-2 databases using the nucleotide basic local alignment search tool (BLASTn) GenBank, following which six known SARS-CoV-2 variants and one novel variant were identified on screening 960 samples, of which 27% (n=258) were SARS-CoV-2 positive.
Of 258 SARS-CoV-2-positive specimens, 30 contained a common N-gene missense mutation, whereas six specimens also contained a substitution in ORF1a. The number of viral reads in the sample pool negatively correlated with the cycle threshold (Ct) numbers of the PCR analysis.
The protocol demonstrated 93.3% accuracy, 91.7% precision, 82.5% sensitivity, and 97.3% specificity, and on considering positive samples as those with Ct<30 (for the N-gene), the sensitivity and specificity increased to 100% and 98.5%, respectively, with 94.7% positive predictive value (PPV). The findings indicated that the diagnostic protocol could accurately detect SARS-CoV-2 and its variants.
However, multiplexing the N1 gene and the N2 gene together led to the generation of a non-specific 944 base pair (bp) DNA fragment since the two corresponding amplicons were situated close to each other. The formed 944bp-long fragment was an elongated product comprising the forward primer and reverse primer of the N1 gene and the N2, respectively. Since the fragment was added during the preparation of the genetic libraries library and ran simultaneously with all amplicons, it could give rise to a competitive NGS analysis and a lesser number of reads from DNA fragments under analysis.
Overall, the study findings showed that integrating multiplexed PCR assays, DNA barcoding, sample pooling, NGS, machine learning, and bioinformatics could be an effective diagnostic solution for high-throughput and accurate mass screening for SARS-CoV-2 variant and sequence variation detection.