Written by 9:26 am Covid-19, SARS-CoV-2

Molecular Evolution of the SARS-CoV-2 Virus

As described in another post, it appears that the daily Case Fatality Rate (CFR) for the Covid-19 pandemic, in the US, has been steadily decreasing over time, for the month of October 2020.

There are many possible hypothesis for what that should be the case. The two leading hypothesis are:

  • The SARS-CoV-2 virus has been mutating to a less lethal form
  • We (a collective we, a world-wide we) have vastly over-estimated the CFR of the SARS-CoV-2 virus, and also underestimated the adjustment coefficient to adjust crude CFR to adjusted CFR

This post explores whether there is any hint, in molecular data, indicating that the SARS-CoV-2 virus has been mutating over time to a less lethal form.

We know that the SARS-CoV-1 virus turned into a non-lethal virus by acquiring a 29 nucleotide deletion that impaired replication and expression. The deletion, which appeared relatively early in the epidemic, prevented the 2003-2004 SARS-CoV-1 from becoming a global epidemic. Some of us encountered the mutated form of the SARS-CoV-1 as a simple cold. Perhaps, those who encountered SARS-CoV-2 with minimal or no symptoms, have immune protection due to their encounter with SARS-CoV-1.

We also know that RNA virus mutation rate is so damn high, at least relative to their DNA cousins.

The NCBI/GenBank sequence repository, as of early November 2020, contains 21,167 SARS-CoV-2 sequences, published by labs across all of US. The oldest (in term of time) published SARS-CoV-2 (collected in the Wuhan region of China) is also present as sequence MN985325.1 , collected on January 19, 2020.

The general approach of this post is to perform pairwise alignments of each SARS-CoV-2 sequence contained in GenBank with MN985325.1; then, we record the similarity, and we analyze the evolution of similarity to MN985325.1 over time. The detailed approach is described in another post.

The results of this analysis are quite surprising:

The graph shows the average similarity of SARS-CoV-2 virus sequenced in the USA per collection date. There are clearly outliers, which we’ll discuss later.

By looking at this graph, one cannot miss the dip in similarity to MN985325.1, in the time-frame March through August. What happened? Was the virus mutating so damn fast? And, is it now reverting back to a slower mutation rate? Uh?

I have been struggling to understand what has happened to sequencing over spring and summer, for quite some time.

These days, most, but not all, Molecular Biology labs around US use Illumina sequencers for sequencing purposes.

I have been very lucky to be in touch with Dr. Pascale Leonard and Dr. Anastacia Griego of the Scientific Laboratory Division/Molecular Biology, New Mexico Department of Health. Dr. Leonard’s lab sequences SARS-CoV-2 samples for the purpose of tracing infections. They do publish their sequences in the GenBank repository. I used MW228234.1 (sequenced by Dr. Griego and coworkers) as an example for the understanding of what happened during spring and summer.

MW228234.1, relative to the original Wuhan sequence MN985325.1, displays:

  • similarity of 98.0%
  • 15 single nucleotide mutations
  • trimmed 5′ and 3′ ends
  • a good sized deletion of 263 bases at 27,481

Is that a true deletion, or something happened during sequencing?

Dr. Griego graciously took some time out of her busy days to explain that virus samples are sequenced using protocols published by the ARTIC Network . The ARTIC Network is a worldwide project to standardize sequencing protocols for virus outbreaks. ARTIC published 3 version of the protocol to amplify and sequence the SARS-CoV-2 virus. The first two versions did not fully amplify certain parts of the virus genome, leading to partially sequences. The third version (v3) has been published – the most updated v3 came out in August. v3 minimizes the amplification gaps leading to complete sequences.

At this time, allow me to take a left turn, and lead you down a different path. The brief interaction I had with Dr. Leonard and Dr. Griego, brought back to my attention something that we tend to forget: while we hide behind our computers, complaining, at home, about lockdowns, between a Zoom call and a Microsoft Teams meeting, there are brave people that go into work every day; they handle samples of a deadly virus, risking self-infection everyday, or, risking to bring the infection home. I am sure they work under the strictest safety rules and protocols, for their own protection and the protection of their families; I am sure they observe those protocol meticously. Still, the thought must be in their minds. They are true heroes. We can number-crunch an R program here and a Python script there, but, without their courage, there would be no possibility to collect data and advance knowledge. I tip my hat, to honor Dr. Leonard and Dr. Griego, and all around the world, who work hard, in the middle of a pandemic, to protect us.

End of the left turn, and back to the main road. what happened over spring and summer? Likely, most US laboratories were using ARTIC protocols v1 and v2 to sequence their samples. Partial sequences were produced with gaps due to failure of amplifications. As the v3 protocol become more and more widely adopted, more complete sequences were produced. The average collection date similarity started climbing.

And the outliers? There is wide variance in ability to fully sequence among all laboratories. Some labs publish sequences that are 35% or 50% complete; these labs cause the sharp dips you see in the plot above. Others produce much higher rates. Worthy of notice the Utah Public Health Laboratory Infectious Disease: they have published clean, 100% complete sequences without gaps, using the ARTIC protocol v3 (October 2020 update) and the dideoxy Sanger sequencing protocol (the dideoxy protocol exposes one to nasty mutagenic chemicals and radioactive tracers). Heroes.

Even with simple outlier removal, it is still possible to see the spring-summer reduction in sequencing quality:

ARTIC indicates that the new protocol update should permit sequencing coverage well over 99.8%. As we can see, as of samples collected in mid-October 2020, the average sequence similarity is still under 99.8%. Is it molecular evolution or sequencing mishaps?

What conclusions can be drawn from these data?

  1. The SARS-CoV-2 virus, as sequenced in the Wuhan province of China, never made it stateside. Or maybe a few cases did make it stateside, but the virus mutated rapidly.
  2. Sequence data in GenBank is too incomplete to fully detect potential deletion that might weaken the virus
  3. Sequence variance across different labs in different states could be caused by lock-downs; reduced travel could have caused segregation of different virus strains in different states. Some more lethal, other less so.
  4. GenBank data could be further cleaned by arbitrarily deciding to ignore all gaps/deletions and focusing on the accumulation of single nucleotide mutations. It could be an interesting and fruitful approach.

Tags: , , Last modified: November 23, 2020
Close