Can’t see the wood for the trees? Making sense of data during a global pandemic

The HealthWatch Award 2020 was presented on 20^th October to Professor Jennifer Rogers. Professor Rogers is Head of Statistical Research and Consultancy, PHASTAR and Vice President (External Affairs) at the Royal Statistical Society. It was a happy coincidence that her presentation was on World Statistics Day. A recording of her 50-minute presentation in full including the slide show and the questions and answers that followed can be seen online here. An adapted transcript of her talk follows.

News headlines are telling us what we should do, how we should live our lives, but headlines can be misleading and our own personal experiences can skew our understanding. Would it not be better to give people the tools they need to ask the right questions?

Let’s take for example the humble bacon sandwich, and warnings from news stories that eating bacon boosts your risk of cancer. Headlines like this in recent years caused bacon sales to plummet. But when we see headlines like this, we are looking at something called “relative risk”. It is not answering the real question, which what is our own individual “absolute risk”.

What is the chance of getting pancreatic cancer? The charity Cancer Research UK says that one person in 80 might develop it in their lifetime. If eating bacon is supposed to boost your risk of developing pancreatic cancer by 20%, that is actually increasing the absolute risk from 5 in 400 to 6 in 400 individuals. So, although 20% sounds scary, it’s 20% of what was quite a small number to start with.

2020 has been a really interesting year for medical statistics. Daily government briefings presenting data are being updated on a daily basis. There are all sorts of important questions about the virus that causes Covid-19 – how it is spread, where is the risk, what treatments are most effective. Amidst the flood of data, there has never been a more important time to use this data to inform decisions.

So now we’re going to look at some of the challenges behind even some of the easiest questions.

How many Covid-19 cases are there in the UK? We have become used to seeing daily reported number of cases, and terms like 7-day moving averages, but how can we know how prevalent the disease is in the general population? Are cases going up or down? Are we any worse off now than we were in March?

The number of reported cases can only ever be a proxy, to help estimate what is going on in the general population. Because the way we test people has changed so much over the last 10 months, we can’t easily compare case numbers detected now with what was detected in March. Prevalence is defined as the proportion of the population who are positive. But we don’t have the capacity to test everyone, so we have to make estimates based on assumptions. Let’s assume the test is 100% accurate, and we detect 100 positive cases amongst 1000 tests. That gives us 10% prevalence, which suggests that 10% of people might have the disease. But we can’t know that for certain. In reality the test will sometimes give us false positives, and false negatives. Also, we are assuming the sample of the population being tested is picked at random. But we’re not doing that. We are mainly only testing people who have symptoms or who have been in contact with others who have symptoms.

What effect does this have? Reported numbers of cases in the UK did not seem to come down as fast as they did in other countries. But if we look at the same period there was an increase in the number of tests. So, prevalence of the disease is likely to be lower now than it was in March, even though numbers of positive tests are higher.

Who we are testing has been changing. Firstly it was mainly people hospitalized with bad symptoms. Lots of the general population had the disease but were not tested. Then we started testing the general public, using drive in centres, and national surveillance data was being generated by random sampling of people. The test and trace system came in, with contacts of cases being invited for tests. Schools started, there are plenty of anecdotal stories of youngsters being sent home with a bit of a cold or fresher’s flu and not allowed to go back without a positive test, so the system has been inundated with people who were probably negative. This surge in demand for tests had such an impact on testing capacity, that many people who were positive had difficulty getting tested. So, figures on test results alone make it difficult to infer anything about the actual prevalence of Covid-19.

Can we use actual deaths as a proxy for case numbers? It is certainly a hard end point. Data from the Office for National Statistics uses information from actual death certificates. Look at graph of deaths over time, there is a reduction after the peak in April, then growth.

But the demographic has changed in that time. There are different risks associated with different demographics. In April, most cases of Covid-19 were being recorded in the older populations, who were turning up in hospital with severe symptoms, and who were at highest risk of dying from the disease. But at the end of September, much younger people were testing positive with the disease. The spread of the virus is changing, so we can’t use the death data to estimate the prevalence.

On the one hand, it is great that the government is being transparent about the figures, but when it comes to the number theatre of showing daily reported cases on the news, is it the most useful way to explain what is really going on? There have been instances of good communication at government press conferences, but I’ve also spotted a couple of things that I have taken issue with, and here are my biggest bugbears:

Rates among schoolchildren: On 30th September, Chris Whitty, when talking about figures on weekly test positivity, showed a graph that he said showed that rates among school age children are not going up. I am not sure I entirely agreed with him! The graph showed test positivity by different age groups, and it is true that the positivity rates among the younger groups didn’t seem to be changing based on the information in that graph. But that wasn’t necessarily true. Assume that we are testing children and the test positivity rate is 10%. Test more children and positivity remains 10%, but the number of cases will be going up. From beginning to end September shown the graph, what was actually happening with testing over that period? In fact, the actual number of tests over this period increased by a factor of 2 or 3. So just because the positivity rates stayed the same throughout, that doesn’t mean that the number of cases that were found to be positive also stayed the same. It seemed a slightly sneaky interpretation of the data that told a narrative that I think was quite convenient at the time.

Exponential growth: This is a phrase we hear quite a lot. 21^st September we saw a plot that showed what would happen if cases would double every 7 days, and said that by the middle of October we could reach 50,000 new cases. On the 30^th September when Patrick Vallance was questioned about this he said, “Doubling means things get very big quickly”. But really, “exponential growth” doesn’t necessarily mean “fast”. It is more concerned with the way speed is changing. The speed of growth is proportional to the size of the population. Think of it in terms of doubling times. If the doubling time is a day, that gets big really quickly. But with a doubling time of 10 years, that gets big a lot more slowly. In the month before lockdown on 23^rd March, it is possible to calculate that cases were doubling every 2-3 days. Now, I don’t want to say that cases will not increase rapidly again. But exponential growth should not be a blanket term used to scare people. Different surveillance studies looking at growth rates give us different values. The REACT-1 study at Imperial College London is doing random population sampling of volunteers to learn about community transmission, and is using this to estimate rates of prevalence, and what growth and doubling time might look like. In the latest publication it estimated that between the 18^th September and 5^th October it was doubling every 29 days.

Now, the government figures given to us then projected doubling of the number of cases every 7 days. But if you make it every 29 days, the curve is much flatter, which gives us 6,200 cases daily. The government presented the 7-day doubling time data with no idea of measures of uncertainty. Look at what we actually ended up with in the middle of October: although there was a surge early on, there were lockdowns in place in some areas, and now it does seem like the doubling time is coming down.

There are attempts to get better estimates of the figures. The strategy of the REACT study is to invite people randomly to take part. The ONS survey is inviting households who have taking part in other surveys, doing home visits and offering reimbursement. This is not as random a sample as in the REACT study, and the estimated prevalence figures are different: REACT puts it at 45,000 new cases a day, while ONS estimates 17,200 a day. These differences may be due to the different sampling strategies that they have. But the findings for both will be interesting.

Covid has been a unique opportunity, with lots of challenges around messy data. We are in a pandemic that is brand new, and the knowledge landscape is changing constantly. There are many issues I’ve not been able to touch on here. One is the quality of the data, because there is always a balance between data that is available quickly but may be less accurate or harder to interpret meaningfully, and the data that we have to wait for but is more reliable. I’ve not mentioned risk data, but David Spiegelhalter has done some interesting work comparing risks of dying from coronavirus with annual general mortality rates at different ages. He has found that the risk of death from Covid-19 mirrors the way the risk of death generally increases with age.

This winter I’m going to be working with ITV as their Covid-19 statistician. I’ve been looking at how cases are distributed within cities, and finding that increases in cases in cities are being driven by increased numbers of students. It is also interesting that we are still seeing excess deaths at home but fewer deaths in hospital. There may be a conversation to be had not just about numbers of deaths but also quality of death. Test accuracy is another topic – new tests on the way may be more rapid, but will they be accurate? And of course, clinical trials for vaccines and for treatments, to be sure that anything we get on the market will be safe and effective.

Good data alone doesn’t help us make important decisions, we need that narrative and statisticians will continue to be essential in the fight against Covid. At the end of this we will have a new appreciation of the value of numbers, and we will be leaning on statisticians to lead the way through this.

Adapted from presentation by Jennifer Rogers, 20th October 2020