Visualizing Covid-19 Case Infection Rates across US States — July 2020

Determining relative Covid-19 rates by US State and County

What is Covid-19?

Reported Covid-19 Cases and Deaths

Given the severity of Covid-19, that has attributed to over 100,000 deaths in the United States (as of July 2020), I am surprised by how little public data we have on cases and deaths. The information we do have is updated daily and consist of detected cases counts and likely death counts from Covid-19 by region. In this article, I will visualize Covid-19 transmission across the United States to get a better understanding of where and how fast the disease is spreading. If you want to jump straight to the implementation, look at this Github notebook.

Why analyze Covid-19?

I feel this is very topical as well as from a data perspective very interesting to quantify the spread of Covid-19; the better we understand the spread the better we can collectively take steps to prevent further spread. We also can confirm and identify trends that may not be picked up by other media sources.

Covid 19 Case Data Sources

Any data analysis is only as good as the underlying data. The public data we do have is fairly rudimentary and consists of cases and deaths by geographic unit such as States and Counties. In this article, I will use a well curated and updated dataset from the Johns Hopkins Center for Systems Science and Engineering Github:

Covid Data Format

The data we will be using is fairly simple row of case counts by State and County. To get a State total, we have to sum up each row for a given state. Here is a code sample that displays the data, showing integer Covid-19 case counts by state (“Province_State”) and county (“Admin2”) by date in “wide” format:

Simple Schema — Rows of Case Counts by State and County (“Admin2”)

Questions we’re answering

One positive aspect of having limited data is that you can only attempt to answer some basic questions about Covid-19 in the United States:

  • Where are the relative number of cases increasing?
  • When did these cases increase?
  • How do we compare cases across different regions (states or counties)?

Questions we’re not answering (no public data)

The Johns Hopkins dataset is simple to use and understand but it does not answer other questions related to these topics:

  • Rate of Non-Pharmaceutical mitigation (wearing masks, social distancing, avoiding large gatherings)
  • Population or Population Density of Infected States or Counties
  • Weather (or other local factors) experienced during this period

“Flattening the Curve”

What curve are they referring to? Really — I didn’t know which curve the CDC or others were pointing to, so I had to draw my own assumptions from the plotting the Covid-19 case data:

Inhibiting new infections to reduce the number of cases at any given time — known as “flattening the curve” — allows healthcare services to better manage the same volume of patients. — Wikipedia

The chart below shows the confirmed Covid-19 cases in the State of New York by county. To be explicit, in this instance, there is a New York State, New York County, and a New York City. I’ve used the variable ‘area’ to denote a generic term for a geography, in this case a county.

Decreasing Cases ‘S’ curve

While New York experienced significant absolute number of cases it has recently slowed the rate of growth. We see other large early Covid states such as New Jersey and Illinois also showing a slow down of Covid-19 cases over time, resulting in a ‘S’ shape for the cumulative cases:

Manhattan’s New York County in Cornflower Blue
Chicago’s Cook County in Baby Blue

States with Increasing Cases

While we see several early states showing flattening of the curve we also see other states with ever increasing Covid cases in Florida, Arizona, Texas and California. These are obviously not plateauing as we can see the growth and lack of ‘S’ curve shape.

Maricopa county in light Green.
Harris, Dallas counties with growing cases.
Los Angeles county in light pink.

How do we compare rates across States?

We can see the more obvious growth and flat ‘S’ curve states but how do we compare one state to another? Obviously comparing absolute number of cases is not appropriate as states have dramatically varying populations, infection start dates, as well as density and cultural differences. If we could use a common metric, this would allow us to compare relative growth normalized to the size.

  • Sum the slope over last ’n’ periods to determine flatness of overall curve
y = slice_df.head(14)['cases']
lastY = y[0] # cases seen at start of period
x = np.arange(0, len(y))
polynomialDegree = 1
res = np.polyfit(x, y, polynomialDegree)
# y = res[0] * X + res[1] <---- res[0]=slope and res[1]=intercept
  1. change = slope fit for that set of cases
  2. size = cases for that week
  3. lastN = days relative to current date (7 day periods), e.g. -98 is 14 weeks ago (98 = 7 * 14)

Detailed Example of Numpy PolyFit

Here is a sample showing how we fit a polynomial to a set of values, we use this to fit the number of cases and estimate a slope for that series.

Rough m=Slope Calculation: y=mx + b
  • Florida had 5473 cases
  • One week later, 14 weeks ago (lastN = -98 = 7 * 14 )
  • Florida had 13324 cases
  • Increase of 7851 (~1100 per day)

Normalized Case Growth over Time

Using the above polynomial fitting, we can model the growth of early states and how they have slowed the growth. While the bubble size represents the cumulative number of cases (which as a cumulative number, will only grow or stay flat), we do see a downward trend in the cases that best fits the week over week case trend:

  • New Jersey (red), neighboring New York, was closely correlated
  • Illinois (green) rose then dropped in cases

Normalize by Percentage Change

Up to now we have change in absolute cases and some slope calculation, but we can not easily compare # cases between, say, a densely populated state and a sparsely populated state.

On March 10 (n=150) there were 74 times more than on March 3 (n=2)

Rank States by Flattening or Growing Curve

With the number of cases, we can now sort the states by normalizing the cases and measure by percent change over the last N days. In this case, we set N=14 to rank by the last 14 days (as a percentage):

Rank States by % Change over last N days

Root Awaking

During this most recent time period, we can now identify states that have had the most change, even if the number of cases are relatively small. In this instance, we have Idaho (potato state) that has shown the most increase in the last 14 days:

County-level Growth Rates

We can apply similar calculation to identify the rates of Covid cases within a state, for example in Florida state we can see most counties have small increasing rates of Covid-19 but large counties are still increasing, such as Miami-Dade, Broward and Hillsborough counties, exhibiting exponential increases:

States without Lockdowns

Finally we look at the results of states that did not implement a lockdown. As far as I know these states were: Arkansas, Iowa, Nebraska, North Dakota, South Dakota, Utah and Wyoming. Do you see any obvious trend with these states?

Conclusion and Take-aways

We can look at the public Covid-19 case data from Johns Hopkins to identify where Covid-19 cases are growing and shrinking. This notebook will help you visualize the rate of cases by state and county. By fitting a slope to the growth of cases, use that to first roughly sort states then normalize the data via percentage change compare across different states, thereby enabling comparison between say New York and Florida. Furthermore by using similar model for county by county cases, we can compare across counties even in another state.

Predictive Models

We can use this data to do predictions using Markov Chain Monte Carlo (MCMC) as well but given the range and changing behavior of state populations with respect to Covid-19, it seems we need more data that captures state-specific behavior. For example, the current datasets do not measure social distancing, mobility of populations, nor face covering compliance, making is difficult to assess the effectiveness of non-pharmaceutical mitigations. Until then, we can use this notebook track the case statistics and make inferences on past behavior.

Next Article — Grading the Results

We can grade other municipalities by their case rate relative to their population. We can then compare across States, Counties, and Countries that have had varying approaches to Covid-19. This would help us identify what factors contributed to different outcomes in differing states or counties. From there we can take a similar approach to grading the impact of Covid-19 on a specific country, state or country.

Data Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store