Covid19: a New Way to Look at the Data

Irene's Cauldron
7 min readApr 16, 2020

Summary: since case numbers are heavily dependent on test numbers, it is not enough to look at the case numbers without referencing the test numbers. This article proposes a new way to visualize the data ans understand where we are.

I’ve also made a presentation on my personal channel. The purpose of the presentation is to provide a new perspective to understand the development of the pandemic. It does not provide any advice regarding medical or public health issues. Due to limited time, I only looked at the state-level data in the US.

The data comes from 3 sources: the number of confirmed cases from NYT, the number of tests provided by the Covid Tracking Project, and the global data from Our World in Data Org. I really appreciate their work to maintain the datasets. All the data used in this presentation and article was last updated on April 11.

There are various data visualization dashboards on the Internet. The majority of them either demonstrate the exponential nature of the cases or map the growth in different regions. These plots are useful in their own ways. But from a data scientist’s perspective, I feel something is missing. Especially, if you look at maps like the one below: we all know that states like New York and New Jersey are fighting a tough war, but what about states which currently have much fewer confirmed cases? Are they safer necessarily?

source: me

Before continuing on that question, let me ask you, what kind of data will make you feel safe during such a pandemic?

Suppose your community has 100 people, of which 5 people are infected, and there are two scenarios.

Even though the case numbers are the same in the 2 scenarios, I think you would agree that we will feel much safer in the first scenario. The key, as you probably have discovered, is the percentage of people tested to discover the number of people infected.

How exactly will the number of tests affect our estimates of the virus spread? We can use this illustration to find out. The grey blocks are the untested population, the green blocks are the tested and negative, and the red ones are the tested and positive. Each day as more tests are given, the cumulative positive rate will change. Depending on the distribution of the infected population, we may heavily underestimate the spread of the virus in the early days when only a small percentage of the population are tested.

Of course, the animation is an over-simplification of the reality. There are several hidden assumptions: first, the unknown number of infection is constant (virus not spreading when more tests are given); second, this community has the capacity to test all of its population, regardless of whether they have symptoms or have been exposed to the virus.

However, in reality, these assumptions do not hold and we can’t use the sample positive rate to estimate the population’s infection rate directly. First, the virus is spreading every minute, so the actual infection rate is constantly changing. Also, we can’t test the whole population, so only the people that meet their local testing criteria can be tested. These people might be more likely to have positive results than the general population, and each testing batch is not drawn randomly from the population.

But the same principle holds: the more tests we have, the more meaningful all the statistics will become. Case numbers alone lack the key information of how closely we are tracking the virus, and are CAPPED by test numbers, which are influenced by so many man-made decisions.

A scatter plot that adds the percentage of population tested is what I would propose to evaluate each state. As shown in the plots blew, the y-axis is the positive rate (# of positive results / # of tests) and the x-axis is the percentage of the population tested (# of tests / # of the population). Each bubble corresponds to a state; the size of the bubble is proportional to the number of confirmed cases in that state; the color corresponds to the positive rate.

So the first such plot shows where each state was on March 10, at the beginning of the spread in the US. Back then, almost all the states had very few tests, and most of them had a positive rate below 20%. The one state that had officially entered the war and started to ramp up tests was Washington state. New York state had a very high positive test rate before it expanded testing.

March 10

One week later, the positive rate in New York state started to drop to below 14% as more tests were given. On March 20, however, the positive rate increased again, even with 5 times more tests than 5 days ago.

March 20

In the days that follow, as more states have declared states of emergency, we can see the bubbles grow larger and move up faster. It summarizes the dreadful trend that most of the states were seeing the positive rate increasing as more tests have been given.

The virus was spreading faster than we could track it.

April 5

On April 5, Oklahoma turned out to be the state with the highest positive rate. But as we explained earlier, with its limited number of tests given, the positive rate is not really useful. We need to see if the positive rate remains that high when more tests are given. Luckily, it seems that Oklahoma did not follow the trajectory of New York.

This new scatter plot below shows how the situation in Oklahoma developed. Each bubble now corresponds to a different date. Since the x-axis is the cumulative data, the bubbles can only move to the right. So if you go from left to right, it’s moving along time. The shades of the bubble also become darker as time goes on. The size of the bubble now corresponds to the number of new cases, not cumulative cases for better comparison. It’s a good sign that as more tests are given, the positive rate does not move up.

Trajectory of Oklahoma over Time

On the contrary, below is the trajectory of New York state. Even though the state has expanded tests drastically, the positive rate doe not go down and the daily new cases remain high. (But one good thing is, it seems the positive rate is not increasing.)

Trajectory of New York State over Time

By plotting different states in this same framework, we can compare their relative situations. As the plot below suggests, New Jersey has an even higher positive rate than New York and lags behind in providing more tests. As for Massachusetts represented by the blue bubbles, its positive rate is increasing faster than the three others when more tests are given. It is not a good sign.

Comparison of States

One problem with this plot is that it may not be a fair comparison between states either in the absolute number of tests or in the percentage of the population tested. You can argue against either of the choices.

So what will be a clear sign that things are improving? Below is the trajectory of South Korea, the country that set an example of quickly expanding testing capacity and getting things under control. It’s believed that South Korea has passed its peak of the pandemic. In this plot, we can see that both the positive rate and the number of daily new cases started to drop very soon.

To sum up, what good looks like is that as more tests are given, both the positive rate and the number of new cases go down.

We cannot just care about the case numbers because in the extreme scenario if we simply stop giving more tests, the new case number will immediately drop to zero.

Trajectory of South Korea

Lastly, I want to say, let’s be supportive of our health care providers and essential works by staying home and keeping social distancing.

Note:

This is a personal project that I finished over three weekends. There might be mistakes due to the limited time. Feedbacks are welcome and thank you for reading/watching.

--

--