COVID-19, An Analysis¶

Joshua Wentling¶

Introduction¶

On December 31, 2019, the CDC became aware of a cluster of pneumonia cases in Wuhan, China. These, as we later learned, were the first known cases of COVID-19. By January 30, 2020, WHO declared the COVID-19 outbreak a global health emergency. By mid March, much of the U.S. was in lockdown and many state-wide stay-at-home orders were issued. As of December 2020, the pandemic is growing still with an estimated over 74 million cases worldwide. It has been a rollercoaster of a year, but hopefully with the recent vaccine approvals we will see this pandemic come to an end soon. Until then, people around the world are encouraged to stay at home as much as possible and maintain social distancing measures.

It has been almost a year since the pandemic started. Thanks to the hard work of data scientists, we have data for confirmed cases, deaths, and recoveries for every country around the world for each day since this pandemic started. With that data, we can do a lot of analysis and learn more about how the virus has affected different countries (and different states/provinces within those countries) and possibly even build a model using machine learning based on this data. This tutorial will walk you through the entire data science pipeline using data related to COVID-19, and hopefully by the end of it you will have learned a thing or two.

Table of Contents¶

This page is broken up into two parts. The first is a more basic analysis on COVID-19 at the global scale and incorporates all the countries around the world that the dataset I used includes. The second part does this same analysis on COVID-19 within the United States on a state level, then it goes into a deeper analysis on the US including the development of a machine learning model. Much of the code from the first part is repeated in the second part, so I will be including less of an explanation for the repeated code to reduce redundancy.

Data Import
Global
- Initial Data Processing
- Graph and Analysis
- Choropleth
United States
- Graph and Analysis
- Choropleth
- Machine Leaning
Conclusion

Imports¶

We will start with our imports. These are all the necessary libraries that we will be using for this tutorial.

# !pip install folium
# !pip install geopandas
import pandas as pd
import folium
from folium import plugins
from folium.plugins import TimeSliderChoropleth
import branca.colormap as cm
import numpy as np
import math
from matplotlib.ticker import FormatStrFormatter
import matplotlib.ticker as mtick
import scipy
from scipy import stats
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

Global¶

Initial Data Processing¶

Next, we will import our data from a csv and take a look.

# https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset
df = pd.read_csv('Data/Covid/covid_19_data.csv')

pd.set_option('display.max_rows', None)
df.head()

dftemp = df.copy()
dftemp = dftemp.groupby(['Country/Region', 'ObservationDate'], as_index=False).sum()
dftemp = dftemp.sort_values('Confirmed', ascending=False).drop_duplicates(['Country/Region'])
dftemp.sort_values(by=['Country/Region'], ascending=False).astype(str).head()

Graph and Analysis¶

Let's start by making a line graph of cases over time.

We'll make a copy of the original dataframe so that we can make edits for our line graph. The original dataset breaks the cases down by province/state. For this graph, we will just be looking at the data per country. To account for this, we will sum the data by country and oberservation date.

dfl = df.copy()
# Get the sum of all confirmed cases in every country for each day
dfl = dfl.groupby(['ObservationDate'], as_index=False).sum()
# Remove all unnecessary data
dfl = dfl[['ObservationDate','Confirmed']]
dfl.head()

# Convert the observation date strings to a pd datetime
dfl['ObservationDate'] = dfl['ObservationDate'].apply(pd.to_datetime)
# Adjust the scale of case numbers to millions
dfl['Confirmed'] = dfl['Confirmed']/1000000
# Makes each column a different date for plotting
dfl = dfl.set_index('ObservationDate')

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10,5)

ax = dfl.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Number of Cases')
ax.set_title('Total Confirmed Cases Over Time')

formatter = FormatStrFormatter('%0.0fM')
ax.yaxis.set_major_formatter(formatter)

Let's choose a few specific countries to see how their confirmed cases compare

dftemp = df.copy()
dftemp = dftemp.groupby(['Country/Region', 'ObservationDate'], as_index=False).sum()
dftemp = dftemp.sort_values('Confirmed', ascending=False).drop_duplicates(['Country/Region'])
dftemp = dftemp.sort_values(by=['Confirmed'], ascending=False).astype(str).head(7)
dftemp

dfl = df.copy()

# Select the countries we want to see
country_list = dftemp['Country/Region'].tolist()
dfl = dfl.loc[dfl['Country/Region'].isin(country_list)]


# Repeat steps from last line graph
dfl = dfl.groupby(['Country/Region', 'ObservationDate'], as_index=False).sum()

# Remove all unnecessary data
df_confirmed = dfl.copy()
df_confirmed = df_confirmed[['Country/Region','ObservationDate','Confirmed']]
# Convert the observation date strings to a pd datetime
df_confirmed['ObservationDate'] = df_confirmed['ObservationDate'].apply(pd.to_datetime)
# Adjust the scale of case numbers to millions
df_confirmed['Confirmed'] = df_confirmed['Confirmed']/1000000
# Adjust dataframe for plotting
df_confirmed = df_confirmed.pivot(index='ObservationDate', columns='Country/Region', values='Confirmed')

plt.rcParams["figure.figsize"] = (10,10)

ax = df_confirmed.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Number of Cases')
ax.set_title('Total Confirmed Cases Over Time')

formatter = FormatStrFormatter('%0.0fM')
ax.yaxis.set_major_formatter(formatter)

Let's take a look at the same graph but for deaths instead of total cases.

df_deaths = dfl.copy()
df_deaths = df_deaths[['Country/Region','ObservationDate','Deaths']]
# Convert the observation date strings to a pd datetime
df_deaths['ObservationDate'] = df_deaths['ObservationDate'].apply(pd.to_datetime)
# Adjust the scale of case numbers to millions
df_deaths['Deaths'] = df_deaths['Deaths']/1000
# Adjust dataframe for plotting
df_deaths = df_deaths.pivot(index='ObservationDate', columns='Country/Region', values='Deaths')

plt.rcParams["figure.figsize"] = (10,10)

ax = df_deaths.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Number of Cases')
ax.set_title('Total Deaths Over Time')

formatter = FormatStrFormatter('%0.0fK')
ax.yaxis.set_major_formatter(formatter)

While this graph is interesting, it doesn't tell us a whole lot about how each of these countries have handled the outbreak overall, since a country with a smaller population is far less likely to make it onto this list, even if a large percentage of the population was infected. Let's redo this analysis using cases as a percentage of the country's total population.

# Data from: https://population.un.org/wpp/Download/Standard/CSV/
df_pop = pd.read_csv('Data/WPP2019_TotalPopulationBySex.csv')
# We only want data for 2020
df_pop = df_pop[df_pop['Time'] == 2020]
# We want just one row per country
df_pop = df_pop.groupby(['Location', 'Time'], as_index=False).mean()
# Get rid of unnecessary columns
df_pop = df_pop[['Location','PopTotal']]
# Population is in thousands, let's fix that
df_pop['PopTotal'] = df_pop['PopTotal']*1000

# Rename 'Location' to 'Country/Region' so the merge works properly
df_pop = df_pop.rename(columns={'Location': 'Country/Region'})

Now we have a list of countries and population totals for the year 2020! Next we will want to merge this with our existing dataframe. We will use a left merge on the existing dataframe (since that's the most important one) and then take a look to see if we're missing any population data.

First we must see if there is any missing data. Using the code below, we will display only the countries whose population total is NaN, meaning that it did not merge properly with our population data.

# We will not be keeping this dataframe, hence why I used 'dftemp'
dftemp = df.copy()
# Left merge because we only care about countries that we have covid data for
dftemp = dftemp.merge(df_pop, on='Country/Region', how='left')
dftemp = dftemp[np.isnan(dftemp['PopTotal'])]
dftemp.groupby(['Country/Region'], as_index=False).mean().head()

Using this list, let's fill in as much of the missing data as we can. Unfortunately, we will not have data for every single country. However, smaller countries with little or no data likely won't have an effect on the analysis we are doing.

df_pop = df_pop.replace({'Country/Region':'Russian Federation'},'Russia')
df_pop = df_pop.replace({'Country/Region':'United States of America'},'US')
df_pop = df_pop.replace({'Country/Region':'Iran (Islamic Republic of)'},'Iran')
df_pop = df_pop.replace({'Country/Region':'Bahamas'},'Bahamas, The')
df_pop = df_pop.replace({'Country/Region':'Azerbaijan'},'Azerbaijan')
df_pop = df_pop.replace({'Country/Region':'United Kingdom'},'UK')
df_pop = df_pop.replace({'Country/Region':'Bolivia (Plurinational State of)'},'Bolivia')
df_pop = df_pop.replace({'Country/Region':'Brunei Darussalam'},'Brunei')
df_pop = df_pop.replace({'Country/Region':'Democratic Republic of the Congo'},'Congo (Kinshasa)')
df_pop = df_pop.replace({'Country/Region':'Congo'},'Congo (Brazzaville)')
df_pop = df_pop.replace({'Country/Region':'Curaçao'},'Curacao')
df_pop = df_pop.replace({'Country/Region':'Czechia'},'Czech Republic')
df_pop = df_pop.replace({'Country/Region':'Timor-Leste'},'East Timor')
df_pop = df_pop.replace({'Country/Region':'China, Hong Kong SAR'},'Hong Kong')
df_pop = df_pop.replace({'Country/Region':'China'},'Mainland China')
df_pop = df_pop.replace({'Country/Region':'Republic of Korea'},'South Korea')
df_pop = df_pop.replace({'Country/Region':'China, Taiwan Province of China'},'Taiwan')
df_pop = df_pop.replace({'Country/Region':'United Republic of Tanzania'},'Tanzania')
df_pop = df_pop.replace({'Country/Region':'Venezuela (Bolivarian Republic of)'},'Venezuela')

Then we can run our merge code again.

df_merge = df.copy()
df_merge = df_merge.merge(df_pop, on='Country/Region', how='left')

Finally, let's add a column that is confirmed cases divided by the population total to get us a confirmed percentage for each country.

df_merge['ConfirmedPercentage'] = df_merge['Confirmed']/df_merge['PopTotal']*100
df_merge.head()

Confirmed Cases as Percentage of Population¶

Now let's do some analysis using our new ConfirmedPercentage category.

We'll start with another line graph. Let's see if our top 5 countries are the same using ConfirmedPercentage instead of Confirmed.

dftemp = df_merge.copy()
dftemp = dftemp.groupby(['Country/Region', 'ObservationDate'], as_index=False).sum()
dftemp = dftemp.sort_values('ConfirmedPercentage', ascending=False).drop_duplicates(['Country/Region'])
dftemp = dftemp.sort_values(by=['ConfirmedPercentage'], ascending=False).astype(str).head(7)
dftemp

import matplotlib.ticker as mtick

dfl = df_merge.copy()

# Select the countries we want to see
country_list = dftemp['Country/Region'].tolist()
dfl = dfl.loc[dfl['Country/Region'].isin(country_list)]

# Repeat steps from last line graph
dfl = dfl.groupby(['Country/Region', 'ObservationDate'], as_index=False).sum()

# Remove all unnecessary data
df_confirmed = dfl.copy()
df_confirmed = df_confirmed[['Country/Region','ObservationDate','ConfirmedPercentage']]
# Convert the observation date strings to a pd datetime
df_confirmed['ObservationDate'] = df_confirmed['ObservationDate'].apply(pd.to_datetime)
# Adjust the scale of case numbers to millions
# Adjust dataframe for plotting
df_confirmed = df_confirmed.pivot(index='ObservationDate', columns='Country/Region', values='ConfirmedPercentage')

plt.rcParams["figure.figsize"] = (10,10)

ax = df_confirmed.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Percentage of Population')
ax.set_title('Confirmed Cases As Percentage of Population Over Time')

formatter = mtick.PercentFormatter()
ax.yaxis.set_major_formatter(formatter)

This change to percentage of population gives us a bit more perspective. While the US is still in the top 10, it shows that their are a number of other countries with worse case rates relative to their total population. Andorra has a signficantly higher percentage than any other country at around 9%.

Choropleth¶

Next, we will be making a time-based choropleth to illustrate the spread of the virus over time. There are a few different use cases and ways we can display our data using a time slider choropleth. To start, we will use it to show the overall spread of the virus over time from its beginning last January 2020 until early December of 2020.

I used the following website as a reference to better understand TimeSliderChoropleth. Some of its parameters such as styledict, have very little documentation. From: https://www.jumpingrivers.com/blog/interactive-maps-python-covid-19-spread/

We will start by simply making a copy of the dataframe so we can make modification to it specific to the time slider choropleth.

# dfc: Data frame choropleth
dfc = df.copy()

Some countries have different labels than our geography database (which we will get to below). We must fix these labels so that they match.

dfc = dfc.replace({'Country/Region':'US'},'United States of America')
dfc = dfc.replace({'Country/Region':'Mainland China'},'China')
dfc = dfc.replace({'Country/Region':'UK'},'United Kingdom')
dfc = dfc.replace({'Country/Region':'Republic of the Congo'},'Congo (Brazzaville)')

Some countries are broken down by province/state. This means that the data is dispersed throughout these regions. Our geographic data only accounts for countries, so we will have to sum the data for the country as a whole.

dfc = dfc.groupby(['Country/Region', 'ObservationDate'], as_index=False).sum()

# Convert time to proper format for TimeSliderChoropleth
dfc['ObservationDate'] = pd.to_datetime(dfc['ObservationDate']).astype(int) / 10**9

# Take log of data since some countries have significantly high populations than others
dfc['Confirmed Adjusted'] = dfc['Confirmed'].apply(lambda x: math.log10(1+x)) # Add 1 to account for zeros

Next we need to calculate the color data for the TimeSliderChoropleth. This color map will evenly distribute a range of color hex's across a dataframe column based on their values. In this case, the color map is being applied to the number of confirmed cases. The specific color map used doesn't affect the data, just the colors used in the map.

max_color = max(dfc['Confirmed Adjusted'])
min_color = min(dfc['Confirmed Adjusted'])
cmap = cm.linear.YlOrRd_09.scale(min_color, max_color)
dfc['Color'] = dfc['Confirmed Adjusted'].apply(cmap)

First we must get the geography data for all countries. This is done using the geopandas dataset. Found here: https://nbviewer.jupyter.org/github/python-visualization/folium/blob/master/examples/TimeSliderChoropleth.ipynb

import geopandas as gpd
assert "naturalearth_lowres" in gpd.datasets.available
datapath = gpd.datasets.get_path("naturalearth_lowres")
gdf = gpd.read_file(datapath)

Next, we must create a style dictionary to store the color and opacity values for each date for each country. This is stored as a dictionary of dictionaries. Each entry in the dictionary correlates to a country and then the inner dictionary contains color values for each date.

A styledict is defined as: A dictionary where the keys are the geojson feature ids and the values are dicts of {time: style_options_dict}

styledict = {}

for index, row in gdf.iterrows():
    cdata = dfc[dfc['Country/Region'] == row['name']]
    temp_dict = {}
    
    for i, r in cdata.iterrows():
        temp_dict[str(r['ObservationDate'])] = {'color': r['Color'], 'opacity': 0.8}

    styledict[str(index)] = temp_dict

Finally, we need to create the folium map. Note that I was unable to map the data properly for a couple countries. Those countries will stay white even after adjusting the slider.

map_osm = folium.Map(min_zoom=2, max_bounds=True)

g = TimeSliderChoropleth(
    data=gdf.to_json(),
    styledict=styledict,
).add_to(map_osm)

_ = cmap.add_to(map_osm)

map_osm

United States¶

As a US citizen who is much more familiar with the scope of political, legal, and cultural differences in different areas throughout the country, I would love to take a look at how COVID-19 has affected different states within the US. This analysis could easily be done using state or provincial data in another country. For this tutorial, I will use the US. I also believe the US is a great country to do this on because there are vast changes in population density, climate, and culture from state to state. I would love to do some analysis to see if we can find any statistically significant correlation between any of these factors and the rates of cases for COVID-19. Let's get started!

Graph and Analysis¶

First we will need to create a dataframe with just the data for each US state.

df_us = df.copy()
df_us = df_us[df_us['Country/Region']=='US']

Next, we will repeat the steps above to create a line graph for the top worst states in terms of cases. I won't include explantations since we have already done all this above.

dftemp = df_us.copy()
dftemp = dftemp.groupby(['Province/State', 'ObservationDate'], as_index=False).sum()
dftemp = dftemp.sort_values('Confirmed', ascending=False).drop_duplicates(['Province/State'])
dftemp = dftemp.sort_values(by=['Confirmed'], ascending=False).astype(str).head(7)

dfl = df_us.copy()

# Repeat steps from last line graph
dfl = dfl.groupby(['Province/State', 'ObservationDate'], as_index=False).sum()

# Save current dataframe for later
df_tot = dfl.copy()

# Select the countries we want to see
state_list = dftemp['Province/State'].tolist()
dfl = dfl.loc[dfl['Province/State'].isin(state_list)]

# Remove all unnecessary data
df_confirmed = dfl.copy()

df_confirmed = df_confirmed[['Province/State','ObservationDate','Confirmed']]
# Convert the observation date strings to a pd datetime
df_confirmed['ObservationDate'] = df_confirmed['ObservationDate'].apply(pd.to_datetime)
# Adjust the scale of case numbers to millions
df_confirmed['Confirmed'] = df_confirmed['Confirmed']/1000000
# Adjust dataframe for plotting
df_confirmed = df_confirmed.pivot(index='ObservationDate', columns='Province/State', values='Confirmed')

plt.rcParams["figure.figsize"] = (10,10)

ax = df_confirmed.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Number of Cases')
ax.set_title('Total Confirmed Cases Over Time')

formatter = FormatStrFormatter('%0.0fM')
ax.yaxis.set_major_formatter(formatter)

Unsurprisingly, some of the US's most populous states have the most cases. While this is interesting to see graphically, I would be more interested in seeing how the different states compare relative to each other. I believe the best way to do this is to look at the proportion of people in each state who tested positive for COVID-19. To do this, we can import and merge total population data for each state and calculate the percentage of people in each state who have been infected.

# https://population.un.org/wpp/Download/Standard/CSV/
df_pop = pd.read_csv('Data/State Populations.csv')

# Rename 'Location' to 'Country/Region' so the merge works properly
df_pop = df_pop.rename(columns={'State': 'Province/State'})
df_pop = df_pop.rename(columns={'2018 Population': 'PopTotal'})

df_pop.head(10)

Now that we have the data organized nicely, let's merge this with our existing data and calculated the confirmed percentage.

df_merge = df_us.copy()
df_merge = df_merge.merge(df_pop, on='Province/State', how='inner')

df_merge['ConfirmedPercentage'] = df_merge['Confirmed']/df_merge['PopTotal']*100
df_merge.tail(10)

Perfect. Since we only need the total number of confirmed cases for this analysis, we will drop all rows aside from the most recent one (since the number of confirmed cases in the table is cumulative).

dftemp = df_merge.copy()
# dftemp = dftemp.groupby(['Province/State', 'ObservationDate'], as_index=False).mean()
dftemp = dftemp.sort_values('ConfirmedPercentage', ascending=False).drop_duplicates(['Province/State'])
dftemp = dftemp.sort_values(by=['ConfirmedPercentage'], ascending=False).astype(str).head(7)

Now, we can run the code to create a table using our new confirmed percentage column to see which states have the highest percentage of covid cases.

# dfl: dataframe line (graph)
dfl = df_merge.copy()

# Select the countries we want to see
country_list = dftemp['Province/State'].tolist()
# dfl = dfl.loc[dfl['Province/State'].isin(country_list)]


# Repeat steps from last line graph
dfl = dfl.groupby(['Province/State', 'ObservationDate'], as_index=False).sum()

# Remove all unnecessary data
df_confirmed = dfl.copy()
df_confirmed = df_confirmed[['Province/State','ObservationDate','ConfirmedPercentage']]

# Convert the observation date strings to a pd datetime
df_confirmed['ObservationDate'] = df_confirmed['ObservationDate'].apply(pd.to_datetime)
# Adjust dataframe for plotting
df_plot = df_confirmed.copy()
df_plot = df_plot.loc[dfl['Province/State'].isin(country_list)]
df_plot = df_plot.pivot(index='ObservationDate', columns='Province/State', values='ConfirmedPercentage')

plt.rcParams["figure.figsize"] = (10,10)

ax = df_plot.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Percentage of Population')
ax.set_title('Confirmed Cases As Percentage of Population Over Time')

formatter = mtick.PercentFormatter()
ax.yaxis.set_major_formatter(formatter)

This is much more interesting! This new graph shows the top 7 states with the highest percentages of covid cases. This is very different from our original graph which showcased the overall number of cases.

Choropleth¶

These steps are the same as last time with a slight modification for the US, so I won't be including all the explanations.

# dfc: Data frame choropleth
dfc = df.copy()

dfc = dfc.replace({'Country/Region':'US'},'United States of America')
dfc = dfc[dfc['Country/Region'] == 'United States of America']

# Convert time to proper format for TimeSliderChoropleth
dfc['ObservationDate'] = pd.to_datetime(dfc['ObservationDate']).astype(int) / 10**9

# Take log of data since some countries have significantly high populations than others
dfc['Confirmed Adjusted'] = dfc['Confirmed'].apply(lambda x: math.log10(1+x)) # Add 1 to account for zeros

max_color = max(dfc['Confirmed Adjusted'])
min_color = min(dfc['Confirmed Adjusted'])
cmap = cm.linear.YlOrRd_09.scale(min_color, max_color)
dfc['Color'] = dfc['Confirmed Adjusted'].apply(cmap)

state_path = 'Data/Country_Geo/us-states.json'
state_data = gpd.read_file(state_path)

styledict = {}

for index, row in state_data.iterrows():
    cdata = dfc[dfc['Province/State'] == row['name']]
    temp_dict = {}
    
    for i, r in cdata.iterrows():
        temp_dict[str(r['ObservationDate'])] = {'color': r['Color'], 'opacity': 0.8}

    styledict[str(index)] = temp_dict

map_osm = folium.Map(location=[48, -102], zoom_start=3, min_zoom=2, max_bounds=True)

g = TimeSliderChoropleth(
    data=state_data.to_json(),
    styledict=styledict,
).add_to(map_osm)

_ = cmap.add_to(map_osm)

map_osm

Machine Learning¶

For the last part of this tutorial, let's develop a machine learning model to see if we can use a variety of different factors to accurately predict the amount of covid cases in each state. For this tutorial, we will be doing multiple linear regression model with statsmodel.

Finding Good Variables¶

In order to create an effective model, we must first gather some more data unique to each state that we believe might have an impact on its rates of COVID-19. There are a number of different possible datasets that could provide useful for this machine learning model. I have researched and selected datasets that I believed would be the most useful, however, this could certainly be expanded on.

Political Ideology¶

The COVID-19 pandemic has caused some significant and rapid changes in the way that we live our lives in order to be as safe as possible. Along with this has come some polarizing opinions between people about the best way to handle the pandemic. Some are in favor of increasing restrictions on businesses and individuals to protect the more vulnerable members of society as well as reduce load on our hospitals. Others argue that these restrictions starve small businesses and lower income individuals who need to continue working in order to survive. These differences in opinions have resulted in the politicizing of policy related to the pandemic and has also impacted individual behavior in regards to the utilization of masks and respect for social distancing.

Because of this, I believe that political ideologies could have an impact on the rates of covid from state to state. I found a dataset that scores each state based on the average political views of its citizens as well as its politicians. After running an analysis using each of these scores, I found that the scoring for the individual political ideology to be more useful so I will be using that number in this analysis.

# .dta file meant for stata program; luckily pandas can read this
df_partisan = pd.io.stata.read_stata('Data/stateideology_v2018.dta')

# Let's get an average from the past 5 years
years = [2013,2014,2015,2016,2017]
df_partisan = df_partisan[df_partisan['year'].isin(years)]

# For some reason, 2014 and 2015 are missing the state names
# This code fills in that missing data
df_partisan = df_partisan.replace(r'^\s*$', np.nan, regex=True)
df_partisan = df_partisan.fillna(method='ffill')

# Let's sum the values for the last 6 years
df_partisan = df_partisan.groupby(['statename'], as_index=False).mean()
df_partisan.sort_values('citi6016', ascending=True)

# Let's also make sure the columns align
df_partisan = df_partisan.rename(columns={'statename': 'Province/State'})
# And let's rename the column to something more meaningful
df_partisan = df_partisan.rename(columns={'citi6016': 'CitizenIdeology'})
df_partisan = df_partisan.rename(columns={'inst6017_nom': 'GovernmentIdeology'})

# df confirmed partisan
df_cp = df_confirmed.copy()
df_cp = df_cp.sort_values('ConfirmedPercentage', ascending=False).drop_duplicates(['Province/State'])
df_cp = df_cp.merge(df_partisan, on='Province/State', how='inner')
df_cp.sort_values(by=['ConfirmedPercentage'], ascending=False).astype(str).head(10)
df_cp.tail()

x = df_cp['ConfirmedPercentage']
y = df_cp['CitizenIdeology']
ax = df_cp.plot(x='ConfirmedPercentage',y='CitizenIdeology',kind='scatter',figsize=(10,10))
# Add labels to each point
df_cp[['ConfirmedPercentage','CitizenIdeology','Province/State']].apply(lambda x: ax.text(*x),axis=1)

# Add a trendline
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")

plt.xlabel('Confirmed Percentage')
plt.ylabel('<-Conservative    Liberal->')
plt.title('Ideology vs Confirmed Percentage')

# Display r squared value
import scipy
from scipy import stats
slope, intercept, r, p, std_err = scipy.stats.linregress(x, y)
print("R Squared: "+str(r**2))

R Squared: 0.3244881816353702

While this correlation isn't perfect, it was much higher than I was expecting. An r squared value of 0.32 using real world data that is affected by many different factor is reasonably significant for our analysis. There certainly are some outliers, but overall this data follows a noticeable trend between political ideology and percent of confirmed cases. I believe this will be useful for our algorithm.

Next, let's take a look at population density to see if we can find any correlation between that and confirmed percentage of cases.

# Data from: https://www.census.gov/data/tables/2010/dec/density-data-text.html
df_density = pd.read_csv('Data/pop_density.csv')
# df_density = df_density[['STATE_OR_REGION','2010_POPULATION','2010_DENSITY']]
df_density['2010_DENSITY'] = df_density['2010_DENSITY'].str.replace(',','')
df_density['2010_DENSITY'] = pd.to_numeric(df_density['2010_DENSITY'],errors='coerce')
df_density.head()

# df confirmed partisan density
df_cpd = df_cp.copy()
df_density = df_density.rename(columns={'STATE_OR_REGION': 'Province/State'})
df_density = df_density[['Province/State','2010_DENSITY']]
df_cpd = df_cpd.merge(df_density, on='Province/State', how='inner')


x = df_cpd['ConfirmedPercentage']
y = df_cpd['2010_DENSITY']
ax = df_cpd.plot(x='ConfirmedPercentage',y='2010_DENSITY',kind='scatter',figsize=(10,10))
# Add labels to each point
df_cpd[['ConfirmedPercentage','2010_DENSITY','Province/State']].apply(lambda x: ax.text(*x),axis=1)

# Add a trendline
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")

plt.xlabel('Confirmed Percentage')
plt.ylabel('Density (People per square mile)')
plt.title('Density vs Confirmed Percentage')

# Display r squared value
import scipy
from scipy import stats
slope, intercept, r, p, std_err = scipy.stats.linregress(x, y)
print("R Squared: "+str(r**2))
# df_density.head(100)

R Squared: 0.05136597912659773

Unfortunately, this r squared value is significantly lower than the correlation for political ideology. Our outlier data seems to be skewing our trendline as well. This is surprising because you might expext that more densely populated states to have higher rates since there are more people per unit area.

Next, we will import some data from census.gov with different data points and population numbers for each of them for each state. For this next graph, we will be looking at the average age of citizens in each state to see if we can find a correlation between average age and covid percentage rates.

# Data from: https://www.census.gov/data/datasets/time-series/demo/popest/2010s-state-detail.html
# df_asr: df age sex race
df_asr = pd.read_csv('Data/sc-est2019-alldata6.csv')
# Remove columns we don't need
df_asr = df_asr[['NAME','SEX','ORIGIN','RACE','AGE','POPESTIMATE2019']]

df_age = df_asr.copy()
df_age = df_age[df_age.SEX != 0]
df_age = df_age[df_age.ORIGIN != 0]
df_age = df_age[['NAME','AGE','POPESTIMATE2019']]
df_age = df_age.groupby(['NAME', 'AGE'], as_index=False).sum()
df_age['weighted'] = df_age['AGE'] * df_age['POPESTIMATE2019']
df_age = df_age.groupby(['NAME'], as_index=False).sum()
df_age['AverageAge'] = df_age['weighted'] / df_age['POPESTIMATE2019']
df_age = df_age[['NAME','AverageAge']]
df_age.head(10)

df_cpa = df_cpd.copy()

df_age = df_age.rename(columns={'NAME': 'Province/State'})
df_cpa = df_cpa.merge(df_age, on='Province/State', how='inner')
df_cpa.head()

x = df_cpa['ConfirmedPercentage']
y = df_cpa['AverageAge']
ax = df_cpa.plot(x='ConfirmedPercentage',y='AverageAge',kind='scatter',figsize=(10,10))
# Add labels to each point
df_cpa[['ConfirmedPercentage','AverageAge','Province/State']].apply(lambda x: ax.text(*x),axis=1)

# Add a trendline
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")

plt.xlabel('Confirmed Percentage')
plt.ylabel('Average Age')
plt.title('Average Age vs Confirmed Percentage')

# Display r squared value
import scipy
from scipy import stats
slope, intercept, r, p, std_err = scipy.stats.linregress(x, y)
print("R Squared: "+str(r**2))
# df_cpa.head(100)

R Squared: 0.256351432924362

With an r squared value of 0.26, we are getting a correlation closer to what we had with the political ideology. There seems to be a correlation between younger average age and higher percentages of covid cases. This would seem to make sense, since younger people are less severely affected by covid and are more likely to expose themselves to more people. While we can't necessarily make any conclusions about why this correlation exists, it would seem that there is somewhat of a correlation between average age and confirmed case percentages.

While I don't expect to see any correlation with this next set of data, let's run the analysis since we alread have the data imported into our table. This next graph will be comparing the "average" sex (ratio of male to female with male being 1 and female being 2) of each state with its covid percentage rates.

df_sex = df_asr.copy()
df_sex = df_sex[df_sex.SEX > 0]
df_sex = df_sex[df_sex.ORIGIN != 0]
# df_sex = df_sex[['NAME','SEX','AGE','POPESTIMATE2019']]
df_sex = df_sex.groupby(['NAME', 'SEX'], as_index=False).sum()
df_sex['weighted'] = df_sex['SEX'] * df_sex['POPESTIMATE2019']
df_sex = df_sex.groupby(['NAME'], as_index=False).sum()
df_sex['AverageSex'] = df_sex['weighted'] / df_sex['POPESTIMATE2019']
df_sex = df_sex[['NAME','AverageSex']]
df_sex.head()

df_cpas = df_cpa.copy()

df_sex = df_sex.rename(columns={'NAME': 'Province/State'})
df_cpas = df_cpas.merge(df_sex, on='Province/State', how='inner')
df_cpas.head()

x = df_cpas['ConfirmedPercentage']
y = df_cpas['AverageSex']
ax = df_cpas.plot(x='ConfirmedPercentage',y='AverageSex',kind='scatter',figsize=(10,10))
# Add labels to each point
df_cpas[['ConfirmedPercentage','AverageSex','Province/State']].apply(lambda x: ax.text(*x),axis=1)

# Add a trendline
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")

plt.xlabel('Confirmed Percentage')
plt.ylabel('Average Sex (Male-1 Female-2)')
plt.title('Sex vs Confirmed Percentage')

# Display r squared value
import scipy
from scipy import stats
slope, intercept, r, p, std_err = scipy.stats.linregress(x, y)
print("R Squared: "+str(r**2))

R Squared: 0.13591456000979601

Unsurprisingly, there doesn't seem to be a significant correlation between sex and covid rates, however our r squared value is slightly higher than that of density at 0.14. If any correlation exists, it would seem to imply that high ratios of men related to higher percentages of covid.

Now let's take a look at average temperature and covid rates per state. I expect to see some correlation here, since warmer or colder weather would affect how much people are inside or outside.

# Data from: https://corgis-edu.github.io/corgis/csv/weather/
df_weather = pd.read_csv('Data/weather.csv')

# Fix column labels and remove unnecessary data
df_weather['Station.State'] = df_weather['Station.State'].replace(['DE'],'Delaware')
df_weather['Station.State'] = df_weather['Station.State'].replace(['VA'],'Virginia')
df_weather = df_weather[df_weather['Station.State'] != 'Puerto Rico']

df_weather = df_weather.groupby(['Station.State'], as_index=False).mean()
df_weather = df_weather[['Station.State','Data.Temperature.Avg Temp']]

df_weather.head()

df_cpast = df_cpas.copy()

df_weather = df_weather.rename(columns={'Station.State': 'Province/State'})
df_weather = df_weather.rename(columns={'Data.Temperature.Avg Temp': 'AverageTemp'})
df_cpast = df_cpast.merge(df_weather, on='Province/State', how='inner')
df_cpast.head()

x = df_cpast['ConfirmedPercentage']
y = df_cpast['AverageTemp']
ax = df_cpast.plot(x='ConfirmedPercentage',y='AverageTemp',kind='scatter',figsize=(10,10))
# Add labels to each point
df_cpast[['ConfirmedPercentage','AverageTemp','Province/State']].apply(lambda x: ax.text(*x),axis=1)

# Add a trendline
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")

plt.xlabel('Confirmed Percentage')
plt.ylabel('Average Temperature (F)')
plt.title('Temperature vs Confirmed Percentage')

# Display r squared value

slope, intercept, r, p, std_err = scipy.stats.linregress(x, y)
print("R Squared: "+str(r**2))

R Squared: 0.032898004980701896

With an r squared of 0.03, average temperature has had the weakest correlation of all the data we looked at so far. This surprised me the most of all the data we looked at, since there seems to be no correlation at all. As I expained prior to doing this analysis, I had thought that the temperature's effect on how much time people spend indoors or outdoors would affect covid rates, but this does not seem to be the case.

Finally, let's actually run the machine learning model with our variables. We will use statsmodel to run out multiple linear regression. I will be fitting the model with some of the data that we took a look at above.

Let's start by running the model with just one variable to see how well it fits our data.

X = df_cpa['CitizenIdeology']
y = df_cpa['ConfirmedPercentage']
X = sm.add_constant(X)

model = sm.OLS(y, X).fit()

predictions = model.predict(X)
model.summary()

Now let's run it again, this time using multiple variables to see if we can improve our fit. Our last r squared was .32.

from sklearn.preprocessing import PolynomialFeatures
import statsmodels.formula.api as smf

x = df_cpast[['CitizenIdeology','AverageAge','AverageTemp']]
y = df_cpast['ConfirmedPercentage']
model = smf.ols(formula='ConfirmedPercentage ~ CitizenIdeology + AverageAge + AverageTemp', data=df_cpast).fit()

plt.rcParams["figure.figsize"] = (12,10)

model.summary()

from sklearn.preprocessing import PolynomialFeatures
import statsmodels.formula.api as smf

# df_cpast = df_cpast.rename()
df_cpast = df_cpast.rename(columns={'2010_DENSITY': 'Density'})

x = df_cpast[['CitizenIdeology','AverageAge','AverageTemp','Density','AverageSex']]
y = df_cpast['ConfirmedPercentage']
model = smf.ols(formula='ConfirmedPercentage ~ CitizenIdeology + AverageAge + AverageTemp + AverageSex + Density', data=df_cpast).fit()

plt.rcParams["figure.figsize"] = (12,10)

model.summary()

pred = model.predict(x)

plt.scatter(y,pred)

plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted')

Text(0.5, 1.0, 'Actual vs Predicted')

Conclusion¶

The COVID-19 pandemic has had a massive impact on the world and has caused some significant changes in the way we all live our lives. As unfortunate of an event as this is, it is great to see a wealth of publicly available data on a global scale relating to the pandemic.

I found the two most interesting parts of this tutorial to be the choropleth map and machine learning process. Being able to see the progression of the virus using the slider on the map was super fascinating. I also found the process of finding other datasets to be used in combination with the covid one for the machine learning algorithm to be interesting as well. I was especially intrigued by the correlation we saw between political affiliation and percentage rates of COVID-19 in each state.

Utilizing the libraries and tools available for free within python, we can manipulate and analyze this data to learn some really cool and interesting things about how this virus has affected different countries and states/provinces within those countries. It will be really interesting to see how the results of this analysis change over time as more data becomes available, especially as more and more vaccines are becoming available. Additionally, while this tutorial covers a lot of the analysis you can do with this dataset, there are certainly many ways in which you could expand and further analyze the data to learn even more about COVID-19's affects on the world around us. Utilizing this data in combination with other publicly available datasets can result in an even more thorough and detailed understanding of this pandemic and its effects.

	Country/Region	ObservationDate	SNo	Confirmed	Deaths	Recovered
53007	occupied Palestinian territory	03/10/2020	4593	25.0	0.0	0.0
53006	Zimbabwe	12/06/2020	171890	10718.0	291.0	8880.0
52744	Zambia	12/06/2020	171889	17916.0	364.0	17173.0
52480	Yemen	12/06/2020	171888	2337.0	639.0	1549.0
52087	Western Sahara	07/07/2020	58571	10.0	1.0	8.0

	ObservationDate	Confirmed
0	01/22/2020	555.0
1	01/23/2020	653.0
2	01/24/2020	941.0
3	01/25/2020	1438.0
4	01/26/2020	2118.0

	Country/Region	ObservationDate	SNo	Confirmed	Deaths	Recovered
49731	US	12/06/2020	9987689	14757000.0	282299.0	5624444.0
21902	India	12/06/2020	6370326	9677203.0	140573.0	9139901.0
6676	Brazil	12/06/2020	4648893	6603540.0	176941.0	5866657.0
39600	Russia	12/06/2020	14293198	2439163.0	42675.0	1920744.0
16846	France	12/06/2020	1893885	2345648.0	55247.0	175220.0
23645	Italy	12/06/2020	3616285	1728878.0	60078.0	913494.0
49411	UK	12/06/2020	2582103	1727751.0	61342.0	3736.0

	Country/Region	SNo	Confirmed	Deaths	Recovered	PopTotal
0	Azerbaijan	2664.000000	1.000000	0.000000	0.000000	NaN
1	('St. Martin',)	4675.000000	2.000000	0.000000	0.000000	NaN
2	Bahamas, The	7253.666667	3.333333	0.000000	0.000000	NaN
3	Bolivia	76018.431734	69601.712177	3722.612546	43647.557196	NaN
4	Brunei	75495.948718	137.062271	2.278388	124.380952	NaN

	SNo	ObservationDate	Province/State	Country/Region	Last Update	Confirmed	PopTotal	ConfirmedPercentage
0	1	01/22/2020	Anhui	Mainland China	1/22/2020 17:00	1.0	1.439324e+09	6.947707e-08
1	2	01/22/2020	Beijing	Mainland China	1/22/2020 17:00	14.0	1.439324e+09	9.726790e-07
2	3	01/22/2020	Chongqing	Mainland China	1/22/2020 17:00	6.0	1.439324e+09	4.168624e-07
3	4	01/22/2020	Fujian	Mainland China	1/22/2020 17:00	1.0	1.439324e+09	6.947707e-08
4	5	01/22/2020	Gansu	Mainland China	1/22/2020 17:00	0.0	1.439324e+09	0.000000e+00

	Country/Region	ObservationDate	SNo	Confirmed	Deaths	Recovered	PopTotal	ConfirmedPercentage
1127	Andorra	12/06/2020	171724	7050.0	78.0	6238.0	77265.0	9.1244418559503
33049	Montenegro	12/06/2020	171826	38246.0	533.0	26999.0	628062.0	6.089526193273912
28550	Luxembourg	12/06/2020	171812	37017.0	345.0	28029.0	625975.9999999999	5.913485501041574
40951	San Marino	12/05/2020	171090	1789.0	48.0	1412.0	33937.99999999999	5.271377217278568
3660	Bahrain	12/06/2020	171732	87930.0	341.0	86030.0	1701582.9999999998	5.1675410485412705
12408	Czech Republic	12/06/2020	171759	546833.0	8902.0	478094.0	10708982.000000002	5.106302354416133
4789	Belgium	12/06/2020	2066058	591756.0	17320.0	0.0	139075392.00000003	5.1059155022910145

	Province/State	PopTotal
0	California	39776830
1	Texas	28704330
2	Florida	21312211
3	New York	19862512
4	Pennsylvania	12823989
5	Illinois	12768320
6	Ohio	11694664
7	Georgia	10545138
8	North Carolina	10390149
9	Michigan	9991177

	SNo	ObservationDate	Province/State	Country/Region	Last Update	Confirmed	Deaths	PopTotal	ConfirmedPercentage
13887	165057	11/27/2020	Alabama	US	2020-11-28 05:25:50	242874.0	3572.0	4888949	4.967816
13888	165817	11/28/2020	Alabama	US	2020-11-29 05:25:55	244993.0	3572.0	4888949	5.011159
13889	166577	11/29/2020	Alabama	US	2020-11-30 05:26:13	247229.0	3577.0	4888949	5.056895
13890	167337	11/30/2020	Alabama	US	2020-12-01 05:26:18	249524.0	3578.0	4888949	5.103837
13891	168097	12/01/2020	Alabama	US	2020-12-02 05:27:41	252900.0	3638.0	4888949	5.172891
13892	168857	12/02/2020	Alabama	US	2020-12-03 05:28:22	256828.0	3711.0	4888949	5.253235
13893	169617	12/03/2020	Alabama	US	2020-12-04 05:27:27	260359.0	3776.0	4888949	5.325460
13894	170377	12/04/2020	Alabama	US	2020-12-05 05:27:36	264199.0	3831.0	4888949	5.404004
13895	171137	12/05/2020	Alabama	US	2020-12-06 05:26:18	267589.0	3877.0	4888949	5.473344
13896	171897	12/06/2020	Alabama	US	2020-12-07 05:26:14	269877.0	3889.0	4888949	5.520143

	Province/State	ObservationDate	ConfirmedPercentage	state	year	CitizenIdeology	GovernmentIdeology
45	Oregon	2020-12-06	2.012019	37.0	2015.0	58.208584	66.550529
46	New Hampshire	2020-12-06	1.842771	29.0	2015.0	53.932976	47.906269
47	Hawaii	2020-12-06	1.320954	11.0	2015.0	76.090012	69.053322
48	Maine	2020-12-06	0.994945	19.0	2015.0	67.041496	49.225498
49	Vermont	2020-12-06	0.803737	45.0	2015.0	86.200874	67.244621

	STATE_OR_REGION	1910_POPULATION	1920_POPULATION	1930_POPULATION	1940_POPULATION	1950_POPULATION	1960_POPULATION	1970_POPULATION	1980_POPULATION	1990_POPULATION	...	1920_RANK	1930_RANK	1940_RANK	1950_RANK	1960_RANK	1970_RANK	1980_RANK	1990_RANK	2000_RANK	2010_RANK
0	United States	92228531	106021568	123202660	132165129	151325798	179323175	203211926	226545805	248709873	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Alabama	2138093	2348174	2646248	2832961	3061743	3266740	3444165	3893888	4040587	...	25.0	24.0	23.0	24.0	28.0	28.0	28.0	27.0	28.0	29.0
2	Alaska	64356	55036	59278	72524	128643	226167	300382	401851	550043	...	52.0	52.0	52.0	52.0	52.0	52.0	52.0	52.0	52.0	52.0
3	Arizona	204354	334162	435573	499261	749587	1302161	1770900	2718215	3665228	...	49.0	47.0	47.0	47.0	43.0	43.0	42.0	39.0	38.0	35.0
4	Arkansas	1574449	1752204	1854482	1949387	1909511	1786272	1923295	2286435	2350725	...	31.0	32.0	32.0	34.0	36.0	37.0	37.0	37.0	36.0	36.0

	NAME	AverageAge
0	Alabama	39.482018
1	Alaska	36.330879
2	Arizona	39.170100
3	Arkansas	39.016298
4	California	38.033199
5	Colorado	38.106171
6	Connecticut	40.588005
7	Delaware	40.828502
8	District of Columbia	36.648714
9	Florida	41.874931

	Province/State	ObservationDate	ConfirmedPercentage	state	year	CitizenIdeology	GovernmentIdeology	2010_DENSITY	AverageAge
0	North Dakota	2020-12-06	10.987397	34.0	2015.0	32.864830	31.388329	9.7	37.518283
1	South Dakota	2020-12-06	9.796307	41.0	2015.0	39.302322	29.607647	10.7	38.462802
2	Iowa	2020-12-06	7.742031	15.0	2015.0	47.608620	33.479546	54.5	39.130726
3	Wisconsin	2020-12-06	7.581012	49.0	2015.0	50.566242	22.858555	105.0	39.849338
4	Nebraska	2020-12-06	7.235729	27.0	2015.0	29.198908	27.135815	23.8	37.922156

	NAME	AverageSex
0	Alabama	1.516739
1	Alaska	1.478613
2	Arizona	1.503031
3	Arkansas	1.509042
4	California	1.502816

	Station.State	Data.Temperature.Avg Temp
0	Alabama	66.528302
1	Alaska	41.157650
2	Arizona	66.286792
3	Arkansas	62.308176
4	California	62.363363

Dep. Variable:	ConfirmedPercentage	R-squared:	0.324
Model:	OLS	Adj. R-squared:	0.310
Method:	Least Squares	F-statistic:	23.06
Date:	Mon, 21 Dec 2020	Prob (F-statistic):	1.58e-05
Time:	15:13:23	Log-Likelihood:	-94.881
No. Observations:	50	AIC:	193.8
Df Residuals:	48	BIC:	197.6
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	8.4885	0.789	10.756	0.000	6.902	10.075
CitizenIdeology	-0.0722	0.015	-4.802	0.000	-0.102	-0.042

Omnibus:	8.465	Durbin-Watson:	0.498
Prob(Omnibus):	0.015	Jarque-Bera (JB):	7.607
Skew:	0.830	Prob(JB):	0.0223
Kurtosis:	3.946	Cond. No.	178.

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	24.7420	6.368	3.885	0.000	11.923	37.561
CitizenIdeology	-0.0567	0.017	-3.367	0.002	-0.091	-0.023
AverageAge	-0.3466	0.168	-2.061	0.045	-0.685	-0.008
AverageTemp	-0.0613	0.028	-2.178	0.035	-0.118	-0.005

Omnibus:	3.623	Durbin-Watson:	0.815
Prob(Omnibus):	0.163	Jarque-Bera (JB):	2.861
Skew:	0.579	Prob(JB):	0.239
Kurtosis:	3.174	Cond. No.	2.50e+03

Omnibus:	3.044	Durbin-Watson:	0.817
Prob(Omnibus):	0.218	Jarque-Bera (JB):	2.632
Skew:	0.561	Prob(JB):	0.268
Kurtosis:	2.922	Cond. No.	1.11e+05