Outlier detection of SARS-CoV-2 sequences

Download PDF Copy

Revised

By Bhavana KunkalikarReviewed by Aimee MolineuxMay 19 2022

In a recent study posted to the bioRxiv* preprint server, researchers predicted severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) nucleotide sequences using outlier detection.

*Study: Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest. Image Credit: Lightspring/Shutterstock*

This news article was a review of a preliminary scientific report that had not undergone peer-review at the time of publication. Since its initial publication, the scientific report has now been peer reviewed and accepted for publication in a Scientific Journal. Links to the preliminary and peer-reviewed reports are available in the Sources section at the bottom of this article. View Sources

The emergence of novel SARS-CoV-2 variants has raised concerns about the currently administered coronavirus disease 2019 (COVID-19) vaccines. Therefore, the identification and sequencing of newly emerging variants need timely attention.

About the study

In the present study, researchers applied outlier detection to different SARS-CoV-2 nucleotide sequences before and after the emergence of a novel variant.

The team collected a total of 2,11,167 SARS-CoV-2 nucleotide sequences. The sequences selected satisfied the following criteria: (1) being complete with a length of at least 29,000bp; (2) collection data complete with sequences having a complete year-month-day collection date; (3) high coverage in sequences having less than 1% N-bases; (4) with patient status having metadata comprising the patient's age, gender, and clinical status; and (5) low coverage excluded with sequences having more than 5% N-bases excluded. They also collected the time stamp for all the nucleotide sequences.

The team investigated the possibility of detecting the sequence of a novel SARS-CoV-2 variant among the eight SARS-CoV-2 variants, namely, SARS-CoV-2 Alpha (B.1.1.7), Beta (B.1.351), Delta (B.1.617.2), Gamma (P.1), GH (B.1.640), Lambda (C.37), Mu (B.1.621), and Omicron (B.1.1.529) variants.

Two reference datasets were generated for each variant to determine the time point T₁ at which the sequences of each variant emerged on the global initiative on sharing all influenza data (GISAID). The first reference dataset was produced using the GISAID sequences having a timestamp before T₁. The second dataset subsequently represented the emergence of a novel variant for which time stamp T₂ was determined wherein 10% of the variant sequences were mentioned in the GISAID. The second reference dataset was generated using the sequences having a timestamp up to T₂.

The team used an alignment tool called multiple alignment using fast Fourier transform (MAFFT) and the SARS-CoV-2 reference sequence to align the sequences to the reference genome. All the sequences were later converted into a binary Hamming sequence in order to compare the viral reference genome to each of the aligned nucleotide sequences. The team also used the Jaccard similarity measure to explore the similarity of all the sequences.

Outlier detection was performed by defining a local environment around every sequence present in a principal component plot. The timestamp of the tested sequence was subsequently compared to the distribution of that timestamp in the defined local environment.

Results

The study results showed that viral genomes in the GISAID displayed a specific progression pattern with the older sequences clustering in the middle of the Jaccard matrix lot and the newer sequences at the bottom part of the plot. The progression pattern began from the early point cloud to the viral genomes having intermediate timestamps to newer samples. The team also noted that the genomes of the SARS-CoV-2 Omicron strain were the most comparable to those found in the early stages of the pandemic.

Calibrating the outlier detection to align with the Omicron sequences showed a two-dimensional elbow plot with the number of outliers as a function of the local environment and a factor f that defined the number of standard deviations needed to determine that a sequence is an outlier. The researchers observed a distinct shape formed by the reducing pattern in the number of outliers as the factor f increased; however, a sharp decline was observed at f=1.2. This highlighted that f=1.2 was a consistent choice for all the variants.

Local detection of outliers showed that the outliers were present in a local epsilon environment with 19 out of 25 Omicron genomes detected. The team also noted that while many sequences detected in this calibration were not Omicron-related, they belonged to the SARS-CoV-2 Delta variant. Moreover, for the SARS-CoV-2 Delta, Beta, GH, and Omicron variants, the number of outliers detected significantly increased after the emergence of that variant. On the other hand, when other variants were considered, the difference in the number of outliers was less substantial. Notably, for the SARS-CoV-2 Gamma variant, the number of outliers detected reduced after the Gamma variant emerged.

Conclusion

Overall, the study findings showed that outlier detection could serve as an important tool to recognize novel emerging SARS-CoV-2 variants using machine learning techniques as well as statistical methods.

Journal references:

Preliminary scientific report. Georg Hahn, Sanghun Lee, Dmitry Prokopenko, Jonathan Abraham, Tanya Novak, Julian Hecker, Michael Cho, Surender Khurana, Lindsey R. Baden, Adrienne G. Randolph, Scott T. Weiss, Christoph Lange. (2022). Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest. bioRxiv. doi: https://doi.org/10.1101/2022.05.16.492178 https://www.biorxiv.org/content/10.1101/2022.05.16.492178v1
Peer reviewed and published scientific report. Hahn, Georg, Sanghun Lee, Dmitry Prokopenko, Jonathan Abraham, Tanya Novak, Julian Hecker, Michael Cho, et al. 2022. “Unsupervised Outlier Detection Applied to SARS-CoV-2 Nucleotide Sequences Can Identify Sequences of Common Variants and Other Variants of Interest.” BMC Bioinformatics 23 (1). https://doi.org/10.1186/s12859-022-05105-y. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05105-y.

Article Revisions

May 13 2023 - The preprint preliminary research paper that this article was based upon was accepted for publication in a peer-reviewed Scientific Journal. This article was edited accordingly to include a link to the final peer-reviewed paper, now shown in the sources section.

Posted in: Medical Science News | Medical Research News | Disease/Infection News

Comments (0)

Written by

Bhavana Kunkalikar

Bhavana Kunkalikar is a medical writer based in Goa, India. Her academic background is in Pharmaceutical sciences and she holds a Bachelor's degree in Pharmacy. Her educational background allowed her to foster an interest in anatomical and physiological sciences. Her college project work based on ‘The manifestations and causes of sickle cell anemia’ formed the stepping stone to a life-long fascination with human pathophysiology.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Kunkalikar, Bhavana. (2023, May 13). Outlier detection of SARS-CoV-2 sequences. News-Medical. Retrieved on February 09, 2026 from https://www.news-medical.net/news/20220519/Outlier-detection-of-SARS-CoV-2-sequences.aspx.
MLA
Kunkalikar, Bhavana. "Outlier detection of SARS-CoV-2 sequences". News-Medical. 09 February 2026. <https://www.news-medical.net/news/20220519/Outlier-detection-of-SARS-CoV-2-sequences.aspx>.
Chicago
Kunkalikar, Bhavana. "Outlier detection of SARS-CoV-2 sequences". News-Medical. https://www.news-medical.net/news/20220519/Outlier-detection-of-SARS-CoV-2-sequences.aspx. (accessed February 09, 2026).
Harvard
Kunkalikar, Bhavana. 2023. Outlier detection of SARS-CoV-2 sequences. News-Medical, viewed 09 February 2026, https://www.news-medical.net/news/20220519/Outlier-detection-of-SARS-CoV-2-sequences.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.

Post a new comment

(Logout)

Post

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.