A team of researchers from the Istituto Superiore di Sanita (ISS), Italy, report an open-source platform-independent tool for building severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes from raw sequencing reads. The tool can be used without any extra hardware or software and be run using any browser from a desktop or mobile.
*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
SARS-CoV-2, the causative pathogen of coronavirus disease 2019 (COVID-19) has rapidly spread across the globe resulting in more than two million deaths. Next-generation sequencing technologies (NGS) have allowed complete genome sequencing of the different virus strains, providing estimations of how the virus spreads over time and geographies.
NGS technologies can provide a large amount of sequences. However, one challenge is processing and manipulating the data because of their large size and lack of bioinformatics skills of users.
Many companies have developed platforms to support different sequencing standards and have made them available to users to a limited extent. However, most analysis of sequencing data is done using commercial software that requires licenses or internal command-line-pipelines, which require bioinformatics skills.
Researchers from the ISS in Rome developed an all-in-one pipeline that is independent of any platform for reconstruction and analysis of the complete SARS-CoV-2 genome. They collected common command-line-tools for SARS-CoV-2 genome reconstruction and analysis into a pipeline and implemented it on open-source Galaxy ARIES.
Open-source tool for SARS-CoV-2 genome analysis
The pipeline, called REconstruction of COronaVirus gEnomes & Rapid analysis (RECoVERY) has seven steps: analyzing read quality and trimming, subtracting human sequences, alignment reading and mapping against a reference SARS-CoV-2 sequence, calling variants, calling consensus sequence, de novo assembly, identifying open reading frames (ORFs), and annotating variants.
The authors used the genome sequence of the Wuhan-Hu-1 isolate as the reference to build two databases, one containing the complete virus genome and the other containing the ORFs annotation. Then, they removed the low-quality bases from the imported reads and excluded reads shorter than 30 base pairs.
After removing human genomic sequences, the team mapped the recovered unaligned reads to the reference SARS-CoV-2 sequence and the complete genome sequence is reconstructed using tools developed in-house. When a nucleotide position is not covered by sequencing, or there are less than 30 repetitions, the tool inserts an “N.” They performed coverage analysis using a tool, Qualimap 2. They used the BLASTn tool to annotate ORFs and the tool SnpEff tool to annotate variants.
The sequence read archive (SRA) was obtained from the Illumina, Nanopore, and Ion Torrent platforms. Then the team built the raw data using the pipeline developed in this study and compared the results of the analysis with those obtained from the CLC Genomics Workbench 9.5 and the Genome Detective Virus Tool.
Tool performs better than commercial software
The researchers found that the genomes built using the pipeline were longer by about 54 nucleotides on average compared to those built using CLC and Genome Detective. These genomes showed fewer differences in nucleotides than the genomes built using the other software. This is noteworthy because the missing nucleotides may include incorrect or missing nucleotide assignment, which would make it difficult to study the evolution and distribution of the virus, as most SARS-CoV-2 mutations are single point. Thus, the developed pipeline shows equal or better performance than available genome reconstruction software.
The pipeline reported in this study is freely accessible using the Galaxy instance ARIES. It provides a user-friendly interface and is fast, providing complete genome reconstruction of the SARS-CoV-2 genome in less than an hour for data up to 6 million reads. There is no need for separate hardware or software, and the analysis can be run using any desktop or mobile browser after registration on the ARIES homepage. Furthermore, ARIES does not access users’ data.
The simplicity of use and the production of a comprehensive report with all the variations characterized, make this pipeline a valuable tool particularly for scientists with little or no skill in bioinformatics.”
The analysis is completely automated and the user interface is designed to require little input from the user. According to the authors, using the software as an open-source pipeline will help scientists to work collaboratively for crowdsourcing-based advances on understanding the virus.
*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.