Visualizing Covid-19 Case Infection Rates across US States — July 2020
What is Covid-19?
On February 11, 2020 the World Health Organization announced an official name for the disease that is causing the 2019 novel coronavirus outbreak, abbreviated as COVID-19. In COVID-19, ‘CO’ stands for ‘corona,’ ‘VI’ for ‘virus,’ and ‘D’ for disease. Formerly, this disease was referred to as “2019 novel coronavirus” or “2019-nCoV”.
Reported Covid-19 Cases and Deaths
Given the severity of Covid-19, that has attributed to over 100,000 deaths in the United States (as of July 2020), I am surprised by how little public data we have on cases and deaths. The information we do have is updated daily and consist of detected cases counts and likely death counts from Covid-19 by region. In this article, I will visualize Covid-19 transmission across the United States to get a better understanding of where and how fast the disease is spreading. If you want to jump straight to the implementation, look at this Github notebook.
Why analyze Covid-19?
I feel this is very topical as well as from a data perspective very interesting to quantify the spread of Covid-19; the better we understand the spread the better we can collectively take steps to prevent further spread. We also can confirm and identify trends that may not be picked up by other media sources.
Covid 19 Case Data Sources
Any data analysis is only as good as the underlying data. The public data we do have is fairly rudimentary and consists of cases and deaths by geographic unit such as States and Counties. In this article, I will use a well curated and updated dataset from the Johns Hopkins Center for Systems Science and Engineering Github:
CSSEGISandData - Overview
Dismiss Sign up for your own profile on GitHub, the best place to host code, manage projects, and build software…
Covid Data Format
The data we will be using is fairly simple row of case counts by State and County. To get a State total, we have to sum up each row for a given state. Here is a code sample that displays the data, showing integer Covid-19 case counts by state (“Province_State”) and county (“Admin2”) by date in “wide” format:
We have similar format data for Covid-19 deaths but I will not be analyzing that data in this notebook.
Questions we’re answering
One positive aspect of having limited data is that you can only attempt to answer some basic questions about Covid-19 in the United States:
- Where are the absolute number of cases increasing?
- Where are the relative number of cases increasing?
- When did these cases increase?
- How do we compare cases across different regions (states or counties)?
Questions we’re not answering (no public data)
The Johns Hopkins dataset is simple to use and understand but it does not answer other questions related to these topics:
- Age of those infected by Covid-19
- Rate of Non-Pharmaceutical mitigation (wearing masks, social distancing, avoiding large gatherings)
- Population or Population Density of Infected States or Counties
- Weather (or other local factors) experienced during this period
“Flattening the Curve”
What curve are they referring to? Really — I didn’t know which curve the CDC or others were pointing to, so I had to draw my own assumptions from the plotting the Covid-19 case data:
While I am not an epidemiologist, this seemed to suggest the rate of transmission, given similar behavior over time, will slow, hence “flatten the curve” for transmission. For this creating this chart we use a stacked area plot for the State of New York, with each country represented as a colored band.
Inhibiting new infections to reduce the number of cases at any given time — known as “flattening the curve” — allows healthcare services to better manage the same volume of patients. — Wikipedia
The chart below shows the confirmed Covid-19 cases in the State of New York by county. To be explicit, in this instance, there is a New York State, New York County, and a New York City. I’ve used the variable ‘area’ to denote a generic term for a geography, in this case a county.
We see the largest contribution to the Covid-19 cases were from Manhattan, which is in New York county, and on June 17, there were slightly over 210,000 cumulative confirmed cases:
Let’s look at a GIF of the the complete Plotly-enabled area plot, allowing us to see cases on a per county basis over time, for any county with ability to select and deselect individual counties:
Decreasing Cases ‘S’ curve
While New York experienced significant absolute number of cases it has recently slowed the rate of growth. We see other large early Covid states such as New Jersey and Illinois also showing a slow down of Covid-19 cases over time, resulting in a ‘S’ shape for the cumulative cases:
We see similar ‘S’ shape in Illinois, with Chicago in Cook country having the bulk of Covid-19 cases:
Notice how New Jersey counties have a more even distribution across its counties, but still has the ‘S’ shape representing a flattened curve:
States with Increasing Cases
While we see several early states showing flattening of the curve we also see other states with ever increasing Covid cases in Florida, Arizona, Texas and California. These are obviously not plateauing as we can see the growth and lack of ‘S’ curve shape.
How do we compare rates across States?
We can see the more obvious growth and flat ‘S’ curve states but how do we compare one state to another? Obviously comparing absolute number of cases is not appropriate as states have dramatically varying populations, infection start dates, as well as density and cultural differences. If we could use a common metric, this would allow us to compare relative growth normalized to the size.
What steps should we take to detect flatness and enable comparisons?
- Week over week calculate the rate of change (‘slope’)
- Sum the slope over last ’n’ periods to determine flatness of overall curve
For this we’re going to fit the top cases line as a polynomial, using numpy’s polyfit, with last 14 days of cases, fit line y = mX + B:
y = slice_df.head(14)['cases']
lastY = y # cases seen at start of period
x = np.arange(0, len(y))
polynomialDegree = 1
res = np.polyfit(x, y, polynomialDegree)# y = res * X + res <---- res=slope and res=intercept
We now get normalized ‘change’ values relative to the previous weeks value:
- since100 = number of days it took for state to reach 100 cases
- change = slope fit for that set of cases
- size = cases for that week
- lastN = days relative to current date (7 day periods), e.g. -98 is 14 weeks ago (98 = 7 * 14)
So we see the following, where ‘change’ is slope that best fits the day over day growth in cases for a period of 14 days. The larger the slope, the faster the day over day growth in cases. Concretely, if the ‘change’ is 1.05, then every day there are 5% more cases for that 14 day window.
Detailed Example of Numpy PolyFit
Here is a sample showing how we fit a polynomial to a set of values, we use this to fit the number of cases and estimate a slope for that series.
From the above we get a slope, which we capture in column ‘change’.
- Approximately 15 weeks ago (lastN = -105 = 7 * 15)
- Florida had 5473 cases
- One week later, 14 weeks ago (lastN = -98 = 7 * 14 )
- Florida had 13324 cases
- Increase of 7851 (~1100 per day)
Normalized Case Growth over Time
Using the above polynomial fitting, we can model the growth of early states and how they have slowed the growth. While the bubble size represents the cumulative number of cases (which as a cumulative number, will only grow or stay flat), we do see a downward trend in the cases that best fits the week over week case trend:
We can conclude the following from the above:
- New York (blue) had the most cases (largest bubble)
- New Jersey (red), neighboring New York, was closely correlated
- Illinois (green) rose then dropped in cases
Normalize by Percentage Change
Up to now we have change in absolute cases and some slope calculation, but we can not easily compare # cases between, say, a densely populated state and a sparsely populated state.
We can use the weekly percentage change to enable state to state comparisons. This allows us to compare states of varying populations.
Specifically, we use 7-day moving percentage change to compare results across states. This section of code calculates the percentage change after 7 days:
We then have this result where we can compare across states based on the last Ndays = 90 days.
We see these geographically similar states we can see the decreasing in % Change over the last 90 days:
The x-scale is 90 days and we see dramatic decrease but later we can change the scale to zoom in on the right side if the plot, say last n=30 days.
Rank States by Flattening or Growing Curve
With the number of cases, we can now sort the states by normalizing the cases and measure by percent change over the last N days. In this case, we set N=14 to rank by the last 14 days (as a percentage):
In the dataframe above, the lower rate numerically represents a lower case growth rate for the past N=14 days, with New York and Connecticut showing lowest growth over this period. We can now sort the US states to see how they compare, note the Y-scale is in absolute cases:
We can normalize to percentage change then compare to states that are decreasing over the last 90 days:
Similarly we can now quantify US states that are not doing as well:
We can see this via a normalized view by looking at percentage change:
During this most recent time period, we can now identify states that have had the most change, even if the number of cases are relatively small. In this instance, we have Idaho (potato state) that has shown the most increase in the last 14 days:
Over a slightly longer term, N=30, we can now identify states that have done poorly over this period:
County-level Growth Rates
We can apply similar calculation to identify the rates of Covid cases within a state, for example in Florida state we can see most counties have small increasing rates of Covid-19 but large counties are still increasing, such as Miami-Dade, Broward and Hillsborough counties, exhibiting exponential increases:
States without Lockdowns
Finally we look at the results of states that did not implement a lockdown. As far as I know these states were: Arkansas, Iowa, Nebraska, North Dakota, South Dakota, Utah and Wyoming. Do you see any obvious trend with these states?
Based on the chart it is difficult to isolate trends, as the Y scale is in absolute number of cases. We can look at the percentage change charts to see these states all have higher percentages of transmission of range 5–20% per day, over the last 30 days:
Conclusion and Take-aways
We can look at the public Covid-19 case data from Johns Hopkins to identify where Covid-19 cases are growing and shrinking. This notebook will help you visualize the rate of cases by state and county. By fitting a slope to the growth of cases, use that to first roughly sort states then normalize the data via percentage change compare across different states, thereby enabling comparison between say New York and Florida. Furthermore by using similar model for county by county cases, we can compare across counties even in another state.
We can use this data to do predictions using Markov Chain Monte Carlo (MCMC) as well but given the range and changing behavior of state populations with respect to Covid-19, it seems we need more data that captures state-specific behavior. For example, the current datasets do not measure social distancing, mobility of populations, nor face covering compliance, making is difficult to assess the effectiveness of non-pharmaceutical mitigations. Until then, we can use this notebook track the case statistics and make inferences on past behavior.
Next Article — Grading the Results
We can grade other municipalities by their case rate relative to their population. We can then compare across States, Counties, and Countries that have had varying approaches to Covid-19. This would help us identify what factors contributed to different outcomes in differing states or counties. From there we can take a similar approach to grading the impact of Covid-19 on a specific country, state or country.