New covid analysis – a critique

The headline of the Daily Mirror caught my eye on the BBC news website this morning where it claimed that nearly one third of the UK population had had Covid-19. That was eye opener, and I really wanted to try and understand where this figure came from, so I tracked down and studied the original scientific paper. It’s an interesting analysis (and simple enough for me to understand) but I’ve quite a few questions and so I’m not convinced. I’ve put this blogpost together because for – what ended up being a major UK newspaper headline – there are (imo) significant questions. In this post, I hope to argue that blog readers should be very cautious about accepting the final Daily Mirror headline.

I should say that I believe that published scientific findings mark the start, and not the end, of a discussion. (Although that viewpoint appears to run counter to what a lot of scientific papers usually state.) Therefore, I won’t say that other scientists are wrong, only that I have another version of their narrative. My narrative may be, or may not be, compelling: that is for you – reader – and the interested community to decide.

Just two ‘housekeeping points’ before I really get into it, I’ve written this post in fonts of two colours. The normal (black font) is for general readers and the red font is for those who want to dig into some detail. I would also have liked to have shown the graphs from the paper in this post, but they are protected by copyright (sorry about that, but the paper is open access so you can access it at the link above).

What I agree with in the paper

The paper has an interesting way of looking at the data: if it can be shown that the greater the number of infections the faster Covid-19 spreads, then the lower the number of infection, the slower the virus spreads.

The paper proposes that the number of people infected with Covid-19 could have doubled every three days before lock down, but could this really lead to over a quarter of the UK population having the virus? For me the answer is a slim ‘possibly’: if the first Covid-19 case was found in the UK on 31st January, and the infection numbers doubled every 3 days, a third of the UK population could, maybe if all the conditions were correct, be infected by now.
Wikipedia reports that the first case in the UK was 31st January and I’ll assume that for the one case observed there were between 10 and 100 undiagnosed Covid-19 cases in the community. To get to 18 million infected people (30% of the UK population) that would mean we had between 17.5 and 21 doublings. So three days per doubling gives between 52 and 63 days, which is that by the 23rd March to the 3rd April 30% of the UK population could have been infected. (Note, this assumes that it Covid-19 infection rates grew exponentially, whereas looking at my SIR model it suggest that the doubling time would have increased to about 4 or 5 days but the time the percentage had got to about 25%, but I won’t worry about that just now.) The UK went into lockdown in stages starting on 20th March, and the author’s model shows that it took another 2 weeks for the infection rate to fall to around 1. So it could be possible for 30% of the UK population to be infected by mid-to-end April.

Questions I have about this paper.

One of the questions I have is the relationship between the infection rate and the number of people who become infected. The paper seems to suggest that the spread of the disease stops when everyone has been infected. That’s not true for all diseases, in fact when Ro values are <1 (as this paper suggests) the virus dies out faster than it spreads, and so a large proportions of population isn’t infected at all. With an Ro value of 1.1, only 20% of a population would catch the disease (see the red text below for an explanation). This is relevant for two reasons: firstly, the authors of this paper suggest that the Ro value is now 0.8 and so the disease is vanishing from the population (by their calculations), and not spreading to the whole population; and secondly, the authors use the whole UK population in their calculations because they argue that the disease can only stop spreading when everyone has been infected. Different parts of the paper use different parts of this argument.
The SIR model indicates that the number of people in a population that become infected is very dependent upon the infection rate. I have derived the equation below from equation 14 in this reference.
Ro = log (Se)/(Se-1) and Se + Ne = 1
Ro = [log(1-Ne)]/-Ne
Ro = Ro values (see my previous post for an explanation)
Se = Number of susceptible people at the end of the epidemic.
Ne = Number of people infected at the end of the epidemic.
When you plot Ne (the proportion of people infected) for different values of Ro you get this curve:
So at low Ro values (less that about 1.5) less than half the population becomes infected, because the disease dies out faster than it can spread.

In this paper they authors propose that the relationship between the total number of cases and the infection rate that can be expressed as a mathematical equation, and so – the authors argue – when the infection rate becomes zero the total population of the UK (60 million) will then all have been infected. Then if you use this mathematical equation to calculate the number of cases reported when the infection rate become zero, against the UK population you find that only 6.6 cases are reported per thousand people. Therefore the 100,000 or so reported infections right now (at the end of April beginning of May 2020) is actually 18 million (or 30% of the UK population). But this all depends on the accuracy of that mathematical relationship.

Let roll back a bit and look at the assumption that there is relationship between the number of cases in a population and the rate of new cases. If you go to Figure 3 in the paper it does seem to proposes that – because the red dots are clustered to one side and the green dots are more spread out – a high number of cases indicates a low rate of Covid-19 spread. However, I have three issues with this plot: firstly, the high cases regions (shown in red) are plotted ‘at the front’ of the diagram, so it become difficult to see how many low case regions (green and light green) are behind the high case dots; secondly, the association needs to be made by ‘eye’ (qualitatively) rather than quantitatively; thirdly, if you removed 3 or 4 for the ‘green cases’ with high infection rates (as outliers) I’m not sure if the conclusions would have been as convincing.

Moving onto Figure 4, where I think the main conclusions of the paper come from. When the mathematical relationship between the total number of cases and the infection rate is calculated, it gives an R-squared value of 0.22 (which is really quite low). Figure 4 has a wide spread of results (again if 3 or 4 outliers were removed from the analysis then the results would be significantly different). The authors state in the introduction section:

“…that only one factor, the total reported cases /,000 population is significantly associated with the daily infection rate.”

However, with a R-squared value of 0.22 the infection rate only accounts for 22% of the different external factors (which might include the various social, geographical and biological forces) that determine the number of reported cases, so I can’t agree with the statements of ‘only one factor’ or the ‘significantly associated’.

Before moving on, I need to explain that when we use these sorts of ‘mathematically fitted’ equations, it really useful to understand what the ‘range’ of values in the equation might be. In this paper the authors have used the intercept to make predictions about what happens to the number of accumulated infections when the infection rate becomes zero. However, they haven’t stated what the range of values (in their final equation) might be from their data: they make one statement, of the ‘best fit’ and calculate everything else from there. This is such an important part of the publicised conclusion of this paper (by which I mean the Daily Mirror headline and the University of Manchester press office) that the intercept value needs a range. Without the data that formed this plot (and I can’t find any links to their data files on the journal website) it’s difficult to rebuild it and do the calculations myself, but by eye I suspect that value at which the infection rate becomes zero can be anywhere between 5 and 25 per thousand.

So they are arguing that the infection rate goes to zero when everyone is infected, and that that only happens when the whole of the UK is infected, and further the inflection rate goes to zero when there are 6.6 reported cases per 1000 (when – they say – there should be 1000 per 1000), then that means that 150 people really have had the disease for every person reported. (See the calculations above.)

Here is where I take a different viewpoint and, from the data in this paper, and some of my other work, I can build a different narrative:

  1. The intercept has no range associate with it so if I assume that the intercept value could be 25 per thousand, then there are 40 real cases for every person formally diagnosed with Covid-19. That’s still a bit on the high side, but its not inconsistent with the value that 5% of those with Covid would end up in hospital as proposed in the Imperial College paper, so the number of infected people to reported case would be 20. This gives the proportion of infected people in the UK as 3%
  2. The disease only stops spreading when the entire UK population has it. The SIR model predicts that if the Ro value was 1.1, only 20% of the population would catch the disease, which gives a final value of 200 real cases per 1000 people, and so there would be 30 people (200/6.6) infected per case reported. Which again is not inconsistent with the 5% figure reported in the Imperial College paper.

I should say that neither of the calculations above can be considered definitive, they are illustrative of an alternative, but – I think – realistic, analysis based on the data and charts in the paper.

What I learned from this paper

In my previous work I’d assumed that infected people were ill (and so part of the I group) for 20 days, this paper highlights that people spread the virus for only 5 days. That means that I need a more nuanced approach (which I’ll need to think about a bit).


I’m not an expert in this field, but I have questions about the analysis and the conclusions in this paper. BUT (and it’s a big BUT), scientists need to publish results and ideas so that the whole community (and especially the scientific community) and comment, enter a discussion and replicate the findings. We need to the ability to be right, as well as wrong, and ‘maybes-aye-maybes-naw’ too! Part of the problem here is the media’s misunderstanding of that process and this adds to my growing anxiety about the gap between ‘what scientists publish papers for’ and media (and public) think those papers ‘prove’. But thats’ for another blogpost.

Postscript (1 hour later)

Internet Corryvreckan! I was going to tweet the lead author of a Covid analysis paper, Dr Adrian Heald, about this blogpost with the idea of discussing my counter-narrative and (hopefully) learning. But his feed’s being bombarded big time, and I’m staying well out of that. I think my points have not been raised in the twitter discussion so far, but I’m not going to add any fuel that that twitter firestorm! ….And so my anxiety about how science effectively functions just grows!

Categories: Tags: ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s