Written by valerio• November 22, 2020• 10:01 pm• Covid-19, SARS-CoV-2

Molecular Evolution of the SARS-CoV-2 Virus (Part 2) – Methodology

A quick and effective calculation of Single Nucleotide Variants in SARS-CoV-2 genomes published in GenBank (aligned to the oldest SARS-CoV-2 genome published in GenBank) can be achieved with some UNIX ingenuity.

The approach generated the following graph, discussed in another post:

I did try to use biopython’s Bio:AlignIO() to calculate SNV’s. But, it seems to me that the total number of SNV detected in the alignment object returned by Bio:AlignIO() is incorrect.

I came up with my own way.

This approach requires common UNIX commands available in the PATH. It also requires:

The data generating script is written for a modern UNIX box. I have tested it on Linux. The script will download the command ‘datasets’ by NCBI, to collect sequence from GenBank; it will also download the command ‘faSplit’ by UC Santa Cruz, to split FASTA files. The script can be run in Mac OSX only if the GNU ‘head’ and ‘tail’ commands are available in the PATH. They can be installed with brew.

#!/usr/bin/env bash

count_subs() {
    INPUTFILE=${1}
    TRIM_END=50
    SEQ1=`egrep '# [0-9]:' ${INPUTFILE} | sed -s 's/^# //' | yq  read - 1`
    SEQ2=`egrep '# [0-9]:' ${INPUTFILE} | sed -s 's/^# //' | yq  read - 2`

    SEQ2_NOVER=`echo ${SEQ2} | cut -d '.' -f1`
    SEQ2_TRIM=${SEQ2_NOVER:0:6}
    SUBS=`grep ${SEQ2_TRIM} ${INPUTFILE} | egrep -v '^#' | cut -d" " -f2 | head -c -${TRIM_END}  | tail -c +${TRIM_END} | tr -d '\n .N-' | wc -c`

    echo ${SEQ1},${SEQ2},${SUBS}
}

curl -q -o datasets 'https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets'
chmod a+x datasets
./datasets download virus genome taxon SARS2 --host human --geo-location USA --filename SARS2-hum-USA.zip
unzip SARS2-hum-USA.zip

curl -q -o faSplit http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faSplit
chmod u+x faSplit

mkdir sequences

./faSplit byname ncbi_dataset/data/genomic.fna sequences/

pushd sequences ; ls *.fa > ../sequences.txt ; popd

cp sequences/MN985325.1.fa oldest.fasta

mkdir stretcher

parallel -j4  stretcher oldest.fasta sequences/{} -gapopen 10 -gapextend 1 -out stretcher/{}.stretcher < sequences.txt

echo "accession1,accession2,substitutions" > substitutions.csv

for seq in `cat sequences.txt`; do
    count_subs stretcher/${seq}.stretcher >> substitutions.csv
done

jq -c '{ "accession" : .accession, "collectionDate" : .isolate.collectionDate}' ncbi_dataset/data/data_report.jsonl | jq -s . | \
    jq -r '(map(keys) | add | unique) as $cols | map(. as $row | $cols | map($row[.])) as $rows | $cols, $rows[] | @csv' > collectionDates.csv

Briefly, the script defines a function to count the number of SNV in a markx2 output generated by the EMBOSS stretcher command. The count_subs() function is quite esoteric in nature. It downloads all US-collected SARS-CoV-2 sequences, published in GenBank. It downloads the faSplit command from UCSC. faSplit splits the multi-sequence FASTA file from NCBI to individual FASTA files in the ‘sequences’ directory.

The EMBOSS stretcher command is run to align MN985325.1 to each SARS-CoV-2 genome in GenBank. The alignment markx2 formatted output is saved into the stretcher directory. The EMBOSS stretcher command performs fast global alignment, using a rapid version of the Needleman-Wunsch alignment algorithm. For very large genomes, the EMBOSS needle command (which performs a more robust alignment) would have been more appropriate. For a small 30kb genome, stretcher is more than appropriate, and it runs a couple of order of magnitude faster than needle.

The count_subs() function is run on all stretcher outputs, generating a csv file containing accession number and number of SNVs

Some jq magic is used to generate a csv file containing accession numbers and collection dates.

Time required to run the script to completion:

6.5 hours on an Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (8 cores)
8 hours on a AMD A10-7700K Radeon R7 @ 3.40Ghz ( 4 cores)

The following R program creates the graph at the top:

library(ggplot2)

subs = read.csv("substitutions.csv", stringsAsFactors=F)
subs = subs[subs$accession2 != "MN985325.1",]

cdates = read.csv("collectionDates.csv", stringsAsFactors=F)

cdates$collectionDate = as.POSIXct(cdates$collectionDate, format="%Y-%m-%d")
cdates = cdates[!is.na(cdates$collectionDate),]

m = merge(subs, cdates, by.x="accession2", by.y='accession')

agg = aggregate( substitutions ~ collectionDate, m, mean)

ggplot(agg, aes(x=collectionDate, y=substitutions)) + geom_line() 
     + geom_smooth() + ylab("Average SNV") + xlab("Virus Collection Date")