In a recent study posted to the medRxiv* pre-print server, a team of researchers developed a phylogenetics-based website to identify new severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) strains quickly and efficiently in a region.
In the absence of advanced phylogenetic and analytical tools, the SARS-CoV-2 global sequencing efforts have witnessed a setback. The existing methods for phylogenetic analysis could handle only small and static datasets. Also, they were computationally too expensive to identify clusters of closely related samples and the ever-expanding datasets of densely sampled pathogens, including SARS-CoV-2.
This news article was a review of a preliminary scientific report that had not undergone peer-review at the time of publication. Since its initial publication, the scientific report has now been peer reviewed and accepted for publication in a Scientific Journal. Links to the preliminary and peer-reviewed reports are available in the Sources section at the bottom of this article. View Sources
Even when results were available, these analyses were not readily interpretable for an efficient public health response due to a lack of intuitive visualization and data exploration tools. Overall, there is an unmet need for high-throughput tools that could mount an effective public health response by quickly interpreting the available data, letting public officers take a well-informed public health action.
About the study
The regional index (C) was the core of the phylogenetically informed summary heuristic developed for the study. It is a weighted summary of the composition of descendants of a node of a phylogenetic tree, roughly corresponding to the virus represented by that node was inside or outside a specific area.
When a descendent leaf is genetically identical to the internal node and is inside a specific region, C is equal to one, or else C was equal to zero. The researchers applied additional rules to handle cases where C was undefined. The index calculation is not applicable for leaf nodes, for which accurate geographic location metadata is not available.
Using this method, the researchers traced SARS-CoV-2 transmission clusters in 102 countries using the global parsimony phylogenetic tree, built from 5,563,847 available sequences of SARS-CoV-2 on GISAID, GenBank, and COG-UK25 on 28 November 2021. Cluster size, with ~20% of distinct regional clusters containing 89% of samples, appeared highly skewed, suggesting that novel viral introductions do not essentially lead to the establishment of a locally circulating new strain.
Findings
Over 50% of samples of the genome sequence repositories originated from the USA or the UK, substantially restricting the global transmission analysis, as the inference of a cluster’s origin is dependent on the robustness of sequencing at the origin. Therefore, the researchers focused on the US data, where sequencing across each state was relatively comprehensive and robust, and detailed state-level metadata was available for most samples.
As of November 2021, over 3,00,000 distinct state-level SAR-CoV-2 infection clusters were found in the USA from the beginning of the pandemic. Of these, 84% of clusters had an assigned origin, and 7% of clusters had an international origin, with the majority reflecting transmission within the USA. As expected, Mexico and Canada were among the most common international origin regions, given their long land borders. England was also relatively common because it is well-sampled. These findings suggested that sequencing effort in a given region creates a bias for accurately identifying the origin of new clusters.
The most significant achievement of this work was the development of Cluster-Tracker, an open-source, daily updated website. This website assisted the exploration and prioritization of the latest genome sequences from across the USA, quickly identifying the clusters most likely to be of interest for public health action. Any user could use this website and its flexible backend pipeline to construct a similar site for any set of regions (e.g. country-level), allowing people to explore SARS-CoV-2 phylogenetic data.
Conclusions
The open-source tools, methodologies, and software package described in the study could prove immensely useful for researchers worldwide. The researchers could draw inferences from vast sequence datasets quickly, explore the geographic structures to draw inferences in the context of the spread of SARS-CoV-2, even other densely sampled pathogens in specific areas within the global SARS-CoV-2 phylogeny. In addition, this analytical approach performed well on simulated data and was congruent with a more sophisticated analysis performed during the pandemic.
More importantly, the researchers presented an accessible open-source interactive interface for their results, which could automatically compute and display introductions and clusters with each update to the global phylogenetic tree.
To summarize, this work will empower public health officers to explore the spread of SARS-CoV-2 across the USA and even support public health groups globally to quickly understand and apply insights obtained from the most recent genomic data.
This news article was a review of a preliminary scientific report that had not undergone peer-review at the time of publication. Since its initial publication, the scientific report has now been peer reviewed and accepted for publication in a Scientific Journal. Links to the preliminary and peer-reviewed reports are available in the Sources section at the bottom of this article. View Sources
Article Revisions
- May 10 2023 - The preprint preliminary research paper that this article was based upon was accepted for publication in a peer-reviewed Scientific Journal. This article was edited accordingly to include a link to the final peer-reviewed paper, now shown in the sources section.