Sunday, May 10, 2020

When did the virus responsible for COVID-19 jump species?



With the Australian government following the US government in trying to impute culpability to Chinese handling of the outbreak of COVID-19; and with the Chinese trying to exonerate themselves, and implausibly shift the blame to the Americans, the exact timing of the jump of the virus responsible for COVID-19 (SARS-CoV-2) from animals to humans has become politically fraught.  Fortunately, some information about that event can be gleaned from phylogenetic data from the virus itself.  The determination of the event cannot be exact for a number of reasons; but it has the advantage of being free from politics, and of possible suppression of data.

So far as I have been able to determine from an extensive search of Google Scholar, there has been just one peer reviewed paper tackling that subject - "Evolutionary history, potential intermediate animal host, and cross‐species analyses of SARS‐CoV‐2" by Li et al.  They state in the abstract, "Based on Bayesian time‐scaled phylogenetic analysis using the tip‐dating method, we estimated the time to the most recent common ancestor and evolutionary rate of SARS‐CoV‐2, which ranged from 22 to 24 November 2019...", and more precisely, in the text, "Our results also suggest that the virus originated on 24 November 2019...".  The purpose of this post is to discuss in more detail those results, and what they actually mean.  I will also discuss the results of Andrew Rambaut of the University of Edinburgh, which were directly published to the web (and hence only subject to informal peer review).

Updated:14/5/2020

Methods and Results

Li et al used the BEAST program on 70 SARS-CoV-2 genomes obtained from around the world.  The BEAST program is a standard software package using Bayesian analysis to determine rates of mutation, and time to the Most Recent Common Ancestor of a set of genomes (3).  They begin by testing all permutations of three possible methodological choices:  whether or not the rate of evolution is constant, or varies through time (strict versus relaxed) (4); whether or not the a priori Probability Density Function on the rate of evolution would be strictly defined log normal distribution, or by a "continuous-time Markov Chain (CTMC) reference prior" (constraint dating vs tip dating); and whether or not prior assumption on the population pool is assumed to be constant, or growing exponentially (ie, constant size and exponential growth)" (constant versus exponential). 

They show their results, using all combinations of the three parameters, in their Table 1:


Li et al observe that using Constraint-dating, they get results with a very high uncertainty range, and in some cases ranges and means which are inconsistent with known facts.  For instance, using a strict clock model, contraint dating, and an exponential prior, they show an upper bound on the date of the Most Recent Common Ancestor (MRCA) of September 6th, 2019; a result that is absurd given the R0 and time of first known onset of the disease.  They take this as an empirical disproof of the constraint-dating method in this case; and thereafter only consider the results from Tip-dating (which I have highlighted in green above).  These yield a mean estimate of the date of the MRCA between November 22nd and November 24th.  Their combined uncertainty range extends from October 23rd to December 16th.

Andrew Rambaut (who is a co-author of references (3) and (4)), also uses the BEAST software package.  He ran the software on 75 (Febuary 12th) and 86 (February 24th) genomes, using both constant and exponentially growing population pools, but later dropped the data derived by assuming a fixed population because "...recent data strongly supports a model of growth...".  He does not specify his other assumptions.  He reported mean dates of the MRCA of Nov 29th and Nov 17th for the 75 and 86 genome data pools respectively, with combined uncertainty intervals ranging from August 17th do December 20th.

 Interpreting the Results

It might be tempting to look at the mean dates and assume that it supports an earlier origin to the virus than that implied by the official Chinese account of its origin, but that would be a mistake.  That is, in part, because all individual results show negatively skewed probability distribution.  We know that because in all cases the gap between the lower bound in the mean is larger than that between the upper bound and the mean.  In a distribution with no skew, the mean (ie, the value with the least average error in the estimate) is also the median (ie, the value for which there is a 50% chance that the actual value is higher, and a 50% chance that it is lower).  In a skewed distribution, however, typically they do not align, and specifically, in a negatively skewed distribution, typically the median is larger than the mean; as illustrated in this set of graphs from wikipedia:
 What that tells us is that there is a better than 50% probability that actual date of the Most Recent Common Ancestor has a better than even chance of being more recent than the reported various means reported by Li et al, and by Andrew Rambaut.  In fact, given the combined results, there is a better than 50% probability that the MRCA of the various strains of CoV-SARS-2 was to be found in the last week of November of 2019

Clearly this is consistent with the official Chinese account, on which the first known victim had an onset of symptoms on December 1st, and ergo became infected in the last week of November.  However, the uncertainty of the results are sufficient as to not exclude the only other theory with any (if limited) plausibility.  On that theory, reported by the South China Morning Post, the first patient was actually admitted to hospital on the 17th of November, and hence got infected in the first or second week of November.  As the earliest limit on the date of the MRCA estimated phylogenetically lies in October or August, this data does not exclude that possibility.  (As it happens, I believe that account is rendered very unlikely on other grounds, but that is not relevant to the discussion of these results.)  But though not totally excluded, on these results the alternative account is rendered significantly less likely than the official account.  It is difficult to determine how much less likely because the alternative account gives a date of first admission, nor do I know the precise shape of the posteriori probability distributions (beyond the fact that they have a negative skew).

There are a few extra complexities to consider.  First, the MRCA of the virus may not have existed in 'patient zero'.  It is possible only one line of the virus survived after the first few human to human infections, so that the MRCA post dated 'patient zero'.  It is also possible that multiple humans got infected from an animal source, so that there was no 'patient zero' and the MRCA of the human form of the virus predates any human infection.  The former possibility pushes the date of first human infection backwards in time relative to the MRCA; while the second brings it forward.  Neither is likely to do so to any significant degree relative to the range of uncertainty in the original estimate of the timing of the MRCA.  However, they do mean some additional uncertainty in estimating the time of first human infection from the date of the MRCA of the virus.

Update, 14/5/2020:

I had the good fortune today to stumble across another phyologenetic estimate of the date of the MRCA of various strains of CoV-SARS-2; in this case by Kristian Anderson of Scripps (6).  Anderson made an estimate of the time of the MRCA  on January 25th based on 27 genomes, with an update on January 28th based on 30 genomes.  Unlike other estimates I have discussed, Anderson reports the Median, and shows the distribution of the results.  He found a median of the first of December, 2019, with a 95% confidence range from the 20th of October to the 20th of December.  (In the earlier version, the results were a Median of the 2nd of December, and a confidence interval from October 1st to December 22nd.

Anderson also used BEAST to conduct the analysis; with settings equivalent to Li et al's, strict clock-model, tip-dating clock prior, and constant coalescent tree prior (as best as I can determine).  With those settings, Li et al found a mean of November 24th, and a 95% confidence interval from October 29th to December 15th.  It is likely that a significant proportion of the difference between the main date reported between the two comes down to Anderson choosing to report the Median rather than the Mean, although the median of Li et al's result is likely to be earlier than December 1st (as their distribution is more compact, bringing mean and median closer together).

Unlike the other studies I have discussed in this post, Anderson does publish his density functions for his results.  I have annotated the graph with approximate equivalent dates in the main body of the distribution, and added red marks to indicate the Median, and outer limits of the 95% confidence interval to make the graph easier to read.  The red marks were placed by eye, so should be treated as an approximate guide only.   This graph shows that the distribution has only one mode, and that it does not have any of the unusual features in a distribution that could cause the median to be less than or equal to the mean of a negatively skewed distribution.  That gives me greater confidence in the bolded sentence in the section on interpreting the results above.






1)  Li et al, "Evolutionary history, potential intermediate animal host, and cross‐species analyses of SARS‐CoV‐2", Medical Virology, 2020
 2) Rambaut, Andrew, "Phylogenetic analysis of nCoV-2019 genomes" http://virological.org/t/phylodynamic-analysis-176-genomes-6-mar-2020/356  Accessed 11/5/2020

3)  Drummond et al, "Bayesian Phylogenetics with BEAUti and the BEAST 1.7", Molecular Biology and Evolution, 2012.
4)  Drummond et al, "Relaxed Phylogenetics and dating with confidence", PLOS Biology, 2006.
5)  Webber, Florian, "The Coalescent Model" https://cme.h-its.org/exelixis/web/teaching/seminar2016/Example1.pdf (accessed 11/5/2020)
6) Anderson, Kristian, "Clock and TMRCA based on 27 genomes".  Accessed 14/5/2020 

No comments:

Post a Comment