Your browser doesn't support javascript.
loading
Comprehensive annotations of the mutational spectra of SARS-CoV-2 spike protein: a fast and accurate pipeline
M. Shaminur Rahman; Md. Rafiul Islam; M. Nazmul Hoque; A. S. M. Rubayet Ul Alam; Masuda Akther; J. Akter Puspo; Salma Akter; Azraf Anwar; Munawar Sultana; Md. Anwar Hossain.
Affiliation
  • M. Shaminur Rahman; Department of Microbiology, University of Dhaka
  • Md. Rafiul Islam; University of Dhaka
  • M. Nazmul Hoque; Bangabandhu Sheikh Mujibur Rahman Agricultural University Gazipur-1706, Bangladesh
  • A. S. M. Rubayet Ul Alam; Department of Microbiology, University of Dhaka
  • Masuda Akther; Department of Microbiology, University of Dhaka
  • J. Akter Puspo; Department of Microbiology, University of Dhaka
  • Salma Akter; Department of Microbiology, University of Dhaka
  • Azraf Anwar; British Columbia University, Vancouver, Canada
  • Munawar Sultana; Department of Microbiology, University of Dhaka
  • Md. Anwar Hossain; University of Dhaka
Preprint in English | bioRxiv | ID: ppbiorxiv-177238
ABSTRACT
In order to explore nonsynonymous mutations and deletions in the spike (S) protein of SARS-CoV-2, we comprehensively analyzed 35,750 complete S protein gene sequences from across six continents and five climate zones around the world, as documented in the GISAID database as of June 24th, 2020. Through a custom Python-based pipeline for analyzing mutations, we identified 27,801 (77.77 % of spike sequences) mutated strains compared to Wuhan-Hu-1 strain. 84.40% of these strains had only single amino-acid (aa) substitution mutations, but an outlier strain from Bosnia and Herzegovina (EPI_ISL_463893) was found to possess six aa substitutions. The D614G variant of the major G clade was found to be predominant across circulating strains in all climates. We also identified 988 unique aa substitution mutations distributed across 660 positions within the spike protein, with eleven sites showing high variability - these sites had four types of aa variations at each position. Besides, 17 in-frame deletions at four major regions (three in N-terminal domain and one just downstream of the RBD) may have possible impact on attenuation. Moreover, the mutational frequency differed significantly (p= 0.003, Kruskal-Wallis test) among the SARS-CoV-2 strains worldwide. This study presents a fast and accurate pipeline for identifying nonsynonymous mutations and deletions from large dataset for any particular protein coding sequence and presents this S protein data as representative analysis. By using separate multi-sequence alignment with MAFFT, removing ambiguous sequences and in-frame stop codons, and utilizing pairwise alignment, this method can derive nonsynonymus mutations (ReferencePositionStrain). We believe this will aid in the surveillance of any proteins encoded by SARS-CoV-2, and will prove to be crucial in tracking the ever-increasing variation of many other divergent RNA viruses in the future.
License
cc_by_nc
Full text: Available Collection: Preprints Database: bioRxiv Language: English Year: 2020 Document type: Preprint
Full text: Available Collection: Preprints Database: bioRxiv Language: English Year: 2020 Document type: Preprint
...