Written by valerio• November 23, 2020• 9:20 am• Covid-19, SARS-CoV-2

Molecular Evolution of the SARS-CoV-2 Virus – Methodology (Part 1)

Similarity calculations in SARS-CoV-2 genomes published in GenBank (aligned to the oldest SARS-CoV-2 genome published in GenBank) can be achieved easily.

The approach generated the following graph, discussed in another post:

This image has an empty alt attribute; its file name is image-2-1024x819.png

This approach requires common UNIX commands available in the PATH. It also requires:

The data generating script is written for a modern UNIX box. I have tested it on Linux. The script will download the command ‘datasets’ by NCBI, to collect sequence from GenBank; it will also download the command ‘faSplit’ by UC Santa Cruz, to split FASTA files. The script can be run in Mac OSX only if the GNU ‘head’ and ‘tail’ commands are available in the PATH. They can be installed with brew.

sudo apt-get install emboss parallel mg unzip

curl -o datasets 'https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets'
chmod a+x datasets
./datasets download virus genome taxon SARS2 --host human --geo-location USA --filename SARS2-hum-USA.zip
unzip SARS2-hum-USA.zip

wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faSplit
chmod u+x faSplit

./faSplit byname ncbi_dataset/data/genomic.fna sequences/

pushd sequences ; ls *.fa > ../sequences.txt ; popd

cp sequences/MN985325.1.fa oldest.fasta

mkdir stretcher

parallel -j4  stretcher oldest.fasta sequences/{} -gapopen 10 -gapextend 1 -out stretcher/{}.stretcher < sequences

echo "seq1,seq2,substitutions" > substitutions.csv

for align in `ls *.stretcher`; do
    ./count_subsitutions ${align} >> substitutions.csv
done

cat stretcher/*.stretcher > complete.stretcher

grep -A 10 '# 1: MN985325.1' complete.stretcher | grep  '# ' | grep -v '# 1: MN985325.1' | sed -s -E 's/# 2: ([a-zA-Z0-9\.]+)/\1:/' | sed -s 's/# /  /' | yq r - -j | jq > complete.json

jq -c '{ "accession" : .accession, "collectionDate" : .isolate.collectionDate}' ncbi_dataset/data/data_report.jsonl | jq -s . | \
    jq -r '(map(keys) | add | unique) as $cols | map(. as $row | $cols | map($row[.])) as $rows | $cols, $rows[] | @csv' > collectionDates.csv

Briefly, the script downloads all US-collected SARS-CoV-2 sequences, published in GenBank. It downloads the faSplit command from UCSC. faSplit splits the multi-sequence FASTA file from NCBI to individual FASTA files in the ‘sequences’ directory.

The EMBOSS stretcher command is run to align MN985325.1 to each SARS-CoV-2 genome in GenBank. The alignment markx2 formatted output is saved into the stretcher directory. The EMBOSS stretcher command performs fast global alignment, using a rapid version of the Needleman-Wunsch alignment algorithm. For very large genomes, the EMBOSS needle command (which performs a more robust alignment) would have been more appropriate. For a small 30kb genome, stretcher is more than appropriate, and it runs a couple of order of magnitude faster than needle.

All alignment outputs are collected in the file complete.stretcher.

Grepping, sed and some clever usage of yq and jq extract the metadata from the stretcher.complete file to generate stretcher.json

Some jq magic is used to generate a csv file containing accession numbers and collection dates.

Time required to run the script to completion:

6.5 hours on an Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (8 cores)
8 hours on a AMD A10-7700K Radeon R7 @ 3.40Ghz ( 4 cores)

The following R program creates the graph at the top:

library(jsonlite)
library(yaml)
library(ggplot2)

options(repr.plot.width=20, repr.plot.height=16)

complete = data.frame(do.call(rbind, fromJSON("complete.json")))

df = data.frame(
    accession = row.names(complete),
    similarity = as.double(gsub("\\%\\)", "", gsub("[0-9]*/[0-9]* \\(", "", complete$Similarity)))
)

cdates = read.csv("collectionDates.csv", stringsAsFactors=F)

cdates$collectionDate = as.POSIXct(cdates$collectionDate, format="%Y-%m-%d")
cdates = cdates[!is.na(cdates$collectionDate),]

m = merge(df, cdates, by='accession')

agg = aggregate( similarity ~ collectionDate, m, mean)

options(repr.plot.width=20, repr.plot.height=16)
ggplot(agg, aes(x=collectionDate, y=similarity)) + geom_line() + 
geom_smooth() + ylab("Average Similarity to MN985325.1") + xlab("Sequence Collection Date") +
theme(text = element_text(size = 28))