Outlier detection of SARS-CoV-2 sequences

In a recent study posted to the bioRxiv* preprint server, researchers predicted severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) nucleotide sequences using outlier detection.

Study: Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest. Image Credit: Lightspring/Shutterstock
Study: Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest. Image Credit: Lightspring/Shutterstock

This news article was a review of a preliminary scientific report that had not undergone peer-review at the time of publication. Since its initial publication, the scientific report has now been peer reviewed and accepted for publication in a Scientific Journal. Links to the preliminary and peer-reviewed reports are available in the Sources section at the bottom of this article. View Sources

The emergence of novel SARS-CoV-2 variants has raised concerns about the currently administered coronavirus disease 2019 (COVID-19) vaccines. Therefore, the identification and sequencing of newly emerging variants need timely attention.   

About the study

In the present study, researchers applied outlier detection to different SARS-CoV-2 nucleotide sequences before and after the emergence of a novel variant.

The team collected a total of 2,11,167 SARS-CoV-2 nucleotide sequences. The sequences selected satisfied the following criteria: (1) being complete with a length of at least 29,000bp; (2) collection data complete with sequences having a complete year-month-day collection date; (3) high coverage in sequences having less than 1% N-bases; (4) with patient status having metadata comprising the patient's age, gender, and clinical status; and (5) low coverage excluded with sequences having more than 5% N-bases excluded. They also collected the time stamp for all the nucleotide sequences.

The team investigated the possibility of detecting the sequence of a novel SARS-CoV-2 variant among the eight SARS-CoV-2 variants, namely, SARS-CoV-2 Alpha (B.1.1.7), Beta (B.1.351), Delta (B.1.617.2), Gamma (P.1), GH (B.1.640), Lambda (C.37), Mu (B.1.621), and Omicron (B.1.1.529) variants.

Two reference datasets were generated for each variant to determine the time point T1 at which the sequences of each variant emerged on the global initiative on sharing all influenza data (GISAID). The first reference dataset was produced using the GISAID sequences having a timestamp before T1. The second dataset subsequently represented the emergence of a novel variant for which time stamp T2 was determined wherein 10% of the variant sequences were mentioned in the GISAID. The second reference dataset was generated using the sequences having a timestamp up to T2.

The team used an alignment tool called multiple alignment using fast Fourier transform (MAFFT) and the SARS-CoV-2 reference sequence to align the sequences to the reference genome. All the sequences were later converted into a binary Hamming sequence in order to compare the viral reference genome to each of the aligned nucleotide sequences. The team also used the Jaccard similarity measure to explore the similarity of all the sequences.

Outlier detection was performed by defining a local environment around every sequence present in a principal component plot. The timestamp of the tested sequence was subsequently compared to the distribution of that timestamp in the defined local environment.      

Results

The study results showed that viral genomes in the GISAID displayed a specific progression pattern with the older sequences clustering in the middle of the Jaccard matrix lot and the newer sequences at the bottom part of the plot. The progression pattern began from the early point cloud to the viral genomes having intermediate timestamps to newer samples. The team also noted that the genomes of the SARS-CoV-2 Omicron strain were the most comparable to those found in the early stages of the pandemic.

Calibrating the outlier detection to align with the Omicron sequences showed a two-dimensional elbow plot with the number of outliers as a function of the local environment and a factor f that defined the number of standard deviations needed to determine that a sequence is an outlier. The researchers observed a distinct shape formed by the reducing pattern in the number of outliers as the factor f increased; however, a sharp decline was observed at f=1.2. This highlighted that f=1.2 was a consistent choice for all the variants.

Local detection of outliers showed that the outliers were present in a local epsilon environment with 19 out of 25 Omicron genomes detected. The team also noted that while many sequences detected in this calibration were not Omicron-related, they belonged to the SARS-CoV-2 Delta variant. Moreover, for the SARS-CoV-2 Delta, Beta, GH, and Omicron variants, the number of outliers detected significantly increased after the emergence of that variant. On the other hand, when other variants were considered, the difference in the number of outliers was less substantial. Notably, for the SARS-CoV-2 Gamma variant, the number of outliers detected reduced after the Gamma variant emerged.

Conclusion

Overall, the study findings showed that outlier detection could serve as an important tool to recognize novel emerging SARS-CoV-2 variants using machine learning techniques as well as statistical methods.     

This news article was a review of a preliminary scientific report that had not undergone peer-review at the time of publication. Since its initial publication, the scientific report has now been peer reviewed and accepted for publication in a Scientific Journal. Links to the preliminary and peer-reviewed reports are available in the Sources section at the bottom of this article. View Sources

Journal references:

Article Revisions

  • May 13 2023 - The preprint preliminary research paper that this article was based upon was accepted for publication in a peer-reviewed Scientific Journal. This article was edited accordingly to include a link to the final peer-reviewed paper, now shown in the sources section.
Bhavana Kunkalikar

Written by

Bhavana Kunkalikar

Bhavana Kunkalikar is a medical writer based in Goa, India. Her academic background is in Pharmaceutical sciences and she holds a Bachelor's degree in Pharmacy. Her educational background allowed her to foster an interest in anatomical and physiological sciences. Her college project work based on ‘The manifestations and causes of sickle cell anemia’ formed the stepping stone to a life-long fascination with human pathophysiology.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Kunkalikar, Bhavana. (2023, May 13). Outlier detection of SARS-CoV-2 sequences. News-Medical. Retrieved on November 24, 2024 from https://www.news-medical.net/news/20220519/Outlier-detection-of-SARS-CoV-2-sequences.aspx.

  • MLA

    Kunkalikar, Bhavana. "Outlier detection of SARS-CoV-2 sequences". News-Medical. 24 November 2024. <https://www.news-medical.net/news/20220519/Outlier-detection-of-SARS-CoV-2-sequences.aspx>.

  • Chicago

    Kunkalikar, Bhavana. "Outlier detection of SARS-CoV-2 sequences". News-Medical. https://www.news-medical.net/news/20220519/Outlier-detection-of-SARS-CoV-2-sequences.aspx. (accessed November 24, 2024).

  • Harvard

    Kunkalikar, Bhavana. 2023. Outlier detection of SARS-CoV-2 sequences. News-Medical, viewed 24 November 2024, https://www.news-medical.net/news/20220519/Outlier-detection-of-SARS-CoV-2-sequences.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Genetic risk factors for long-COVID uncovered in a large multi-ethnic study