# The problem of testing

Online some friends and I have been swapping ideas about covid-19 testing and that got me thinking about some of the tricky decisions the government (or health authorities) are facing. What looks like a normal ‘do more see more’ is actually ‘do more see the same’.

The problem is that when you test for disease which exists in community at a low level (at a few percent) getting consistent, reliable results is statistically impossible.

Light rain needs a big puddle.

There’s two ways of thinking about this: imagination and maths. Imagination gives a qualitative feel to the argument and maths gives a quantitative value (or a series of values). I’ll leave it up to you to decide which one you want to go for: the ‘imagination version’ is below and you can geek out to the maths version at th every bottom of the post.

Imagine…

You’re walking the dog along a lakeside on a day when the rain comes and goes. You’ve your hood up and your headphones on. How do you tell how fast it’s raining. If it’s raining fast you can see the raindrops in a small puddle. If it’s light rain you need to look out to the lake to see if the drops are landing in the larger water surface. If the lake is all still, it’s not raining at all. It’s the same with disease testing: when the number of people infected with Covid-19 is low (by which I mean less than 10%), you need a lot of tests to get reliable numbers.

Low numbers need better test kits

Sometimes, if it’s overcast, you might want to know if it’s raining very lightly, so you look out over the lake and you can see, very occasionally, rings spreading out from different points on the water. But, it’s not actually raining, it’s insects landing on the surface. Your puddle/lake test has given you a false positive, and this is only really noticable when there’s very little, or no, rainfall.

As with rainfall, medical tests often have one results, with two flavours. In the case of Covid-19 the result is positive (you have, or have had, the virus) or negative (you have not got, or had, the virus). The problem comes with the small percentage of results that are the wrong way round. A medical test should have a high sensitivity (where a true positive is reported as a positive), and a high specificity (where a true negative is reported as a negative). Lets take an imaginary test which has a specificity of 90% and a selectivity of 90%, and we are using it to screen a population of 1000 people when 10 people (1% of the population) have the condition we’re screening for. After everyone has been tested: nine people would be correctly told they had the condition, and one would be told they didn’t have the condition when they did; 891 people would be correctly told they didn’t have the condition, and about 99 would be told they had the condition when they didn’t. Those 99 false positives (which is 10 time higher than the number of real positive cases) would then go onto have more investigations (which may be invasive and carry risk of injury or harm) to determine if they have the condition or not.

This might put into perspective some of the issues the government is having with getting the tests for Covid-19 up and running.  All we get from the headlines is that the ‘government had bought tests that don’t work’. But for these sorts of tests ‘not working’ needs more nuance: it depends on what proportion of people you’re screening will have the condition and what the consequences for false positives, or false negative are to those people tested (and in the case of Covid-19, the wide community).

At the end of April a paper was put on a pre-print server (see note 1) by a team of Californian scientists reporting that the prevalence of Covid-19 infection was several times higher than expected (at the time they cited 2-3%), and from that figure they concluded that the fatality rate of Covid-19 was less than 0.2%. The study has been criticized for using a test with a low selectivity value while studying a disease with a low infection rate within the community (although the authors say they have compensated for that). The screening example above shows that there are no easy answers to that problem

It finally worth mentioning that the issues around sensitive and selectivity and their impact on whole population medical screening programs are actively debated.

Geographical location

What makes the whole testing issue even more complicated is that different parts of a single country will have different levels of infection: this link (accessed on 19th May 2020) shows that English local authorities of Sunderland and Gateshead (in the north east corner) have a Covid-19 infection level at 5 times that of Dorset (in the south west corner).

At a low disease level, reliable results only come about from an very accurate test, and lots of them. Where should a government put it’s testing resources? In areas of with low infection rates, where disease spread is not so great, but you need large numbers of tests to get an accurate result, or to the areas where there is a high rate of infection where more tests are need because more people are sick?

The science bit – concentrate!

Well done for reading this far! If science and maths isn’t your thing, this might be time to stop reading and go back to twitter or facebook. It gets a bit geeky from here!!

The limitation of sample size occurs across different branches of science. In analytical testing, it is known that as the concentration of a compound being measured decreases the precision of the results gets worse (as measured by the relative standard deviation). This is known as Horwitz’s Trumpet after the FDA scientist who first noticed it.

It’s seen in powder mixing, where it’s known as ‘scale of scrutiny’. Consider the three diagrams below where the grids have 1%, 5% and 20% of their squares set as black. Imagine using a 3 x 3 square to ‘select’ part of the image for sampling. A small sampling square cannot give an accurate estimate of the proportion of black boxes until that proportion gets to a larger value.

Then there’s the heavy maths approach!

It took me a while to sort this one out, and I spent some time looking up an old maths book on combinational theory (!). The easiest way for me to think about this was using binomial distributions and I started building them by hand in Excel before a quick google showed me the ‘BIN.DIST’ function I needed!

An example of the binomial distribution problem might be answering the question ‘what is the chance of me throwing two dice and getting a one and a six?’. The probably of getting a one on the first dice is 1/6, and the probably of getting a six on the second dice is 1/6, so multiplying those we get 1/36. However, getting a six on the first dice and the one on the second also counts (and that’s a different ‘permutation’), so we now have 2 x 1/36 = 1/18: 1/18 being the actual change of ‘getting a one and a six by throwing two die’. By using probability equations for the two possible outcomes (1&6, and 6&1) and the number of different permutations you can get (two) the binomial equation can be derived.

The graphs below show the binomial distributions for 10, 20 and 50 tests, at infection levels of 1%, 5%, 10% and 25% of the population. The lines on these graphs show the probably of getting a particular result if you were to do that number of tests. The legend also gives you where the ‘peak of the curve’ should occur, and what the overall probability of being about to measure any values on the curve at all would be.

Let’s look at an example. If you look at the second plot, where we test 20 samples from a population with 25% of the people are infected.  We’d perhaps expect that we would find five samples which test positive (25% of 20 is 5), but in fact we only that number of positive tests 20% of the time!! 10% of the time you’d get 3 positive tests (60% of the true  value), and another 10% of the time we’d get 7 positive tests (140% of the true value).

“What sort of mathematical sorcery is this?” I hear you cry. We’ll try this. Toss a coin 40 times keep an note of how many heads and tails you get, and keep trying it until you get exactly 20 heads and 20 tails! You will get there after a few repeats! That’s binominal distribution.

So what do these plots tell us? Most of the time with the 1% infection level, we’d never find a single positive sample, even with 50 tests the total probability that we find any positives is less than 0.5 (or 50%). With 10 tests we only get above a 0.9 probability of detecting any positive samples when we’re at a 25% infection level.

In summary: most of the media narrative seems to be based upon an arithmetical approach to sampling and analysis, but – even without considering specificity and selectivity – we need a probabilitistic story. And that’s not really one for bedtime!

___________________

Notes

1. Pre-print servers have become super-popular with scientists studying Covid-19 in the last 6 months. Publishing a paper in the usual scientific journals can take weeks (if not months!), so many scientists are now putting their research findings onto pre-print servers to share information faster. The scientific community is worried that pre-print papers have not been critiqued by ‘peer-review’, and so there is no ‘quality control’ of the work, and it may end up being wrong, or misleading people. However, peer-review is far from a guarentte of repeatable and robust science: this is known as the ‘replication crisis’.