CMSC320 Final Project: How Politics Impacts COVID

Joy Wang, Stephanie Wang, Lucinda Zhou


Throughout 2020 and 2021, there has been great variance in how different states have handled the pandemic.

Access to tests, vaccine compliance, mask mandates and other aspects all vary greatly from state to state. However, these measures seem to somewhat correlate with the party affiliation of each state. For example, in September of 2021, vaccination rates in counties that voted for Biden in the 2020 election were on average 12% higher than counties that voted for Trump. In August of 2021, many states Democratic party led states like California and New York had issued statewise mask requirements for schools while Republican led states like Texas and Florida instead pushed for policies that would prevent schools from requiring mask mandates. They state that such policies would violate peoples' Civil Rights.

However, there have been studies on the effectiveness of masks and vaccines in preventing the transmission of COVID. As such, we want to see if these differences between the two parties that leads to differences in policy, compliance, and general attitude towards the pandemic has any effect on the prevalence of the virus. In other words, we will be looking to see if there is any correlation between the political leaning of each state and its number of positive infections.

Our null hypothesis is that there is no relation between state political leanings and how well they dealt with COVID, while our alternative hypothesis is that blue states dealt better with COVId than red states. We will mainly be considering the rate of positive infections as a measure of how well states combatted COVID, but will also consider other statistics in our dataset, such as deaths and hospitalizations.

Data Curation and Parsing

For our final project, we decided to analyze the dataset from the COVID Tracking project, which is one of the open datasets from Microsoft. We utilized the CSV version of the dataset, and used pandas read_csv() to read and parse the csv file into a pandas dataframe.

While reading Microsoft's documentation for this dataset, we noticed that some columns were described as "Deprecated", so we did some preliminary data cleaning by dropping those columns from the datafram using the drop() function. Since we are analyzing how states handled COVID based on political standards, we will not be considering territories or DC, thus we will remove those from out main dataframe.

To properly compare states, we should consider per capita rates, as differently sized states will have different numbers just on the basis of population size. As such, we need to get the population data for each state. We obtained the Excel file with state populations from the US Census Bureau, so we can be fairly sure of its accuracy.

Positive Infections and Party Affiliations

Grouping by state

We will be using the table cleaned up from earlier that has information about the number of positive infections, hospitalizations, deaths, and reoveries. Since we are looking at the overall performance of each state, we only want the total number of infections across the entire time period. As such, we group the data by state and take the max value from each column since they are cumulative. We then display this new table to see if there are any initial issues with it.

Cleaning the Dataframe

Since we will only be utilizing the 'state', 'positive', 'hospitalized_cumulative', 'recovered, and 'death' columns mentioned above to establish a relationship between states, we will create a new dataframe called lrdata which only contains the necessary columns for our linear regression models. We will drop the rest of the columns, and the new dataframe should look like this.

Since our data analysis is focused on comparing separate states on these different characteristics, we will group the dataframe by the states in the 'state' column. Also, because the values in each of these columns represent cumulative numbers, it makes no sense to take the average of their values. Instead, we will take the max of these values, and assign that as the representative value for each state in each column. Thus, we will get the following dataframe.

Now that we have a dataframe that has all the values we'd like to use and is sorted by state, we need to standardize it by dividing the values for each state by its population size in order to make our observations useful. In order to do this, we will add the population sizes from the populations dataframe as a column in the lrdata dataframe. We will then divide all of the other columns by the values in this column. After that we will have this dataframe, where all values are standardized.

There are some missing data values for some states in the hospitalized_cumulative and recovered columns. However, we are only plotting the positive values in this section and that column has data for all 50 states, so we are good to go.

2020 Election Dataset

We will now need data on the political leanings of each state. Since the pandemic occured during the 2020 election, data from this election will be the most relevant and interesting to analyze. We found a Kaggle dataset that contains information about the number of votes broken down to each county.

We can group this data by state to use in our analysis. The data only shows the raw number of votes for each party by default, but vote percentages are what we want to analyze. As such, for each state, we will be getting the number of votes for the democratic and republican parties and dividing it by the total number of votes cast. This will act as indicator of the political leanings of the state.

Merging the tables and categorizing

Now that we have the data for the percentage of votes in each state that were for the democratic and republican parties in a column, we can merge this column into the covid_data from earlier that has the data on infections.

We can now categorize each state based on the percentage of votes that were for the democratic and republican party in each state. If a state had more votes cast for the republican party, it is considered red. If it had more cast for the democractic party, it is considered blue.

Comparing red and blue states using t-test

We have now categorized the states as either red or blue. Since we want to see if there is a difference in COVID infections between states based on their political alignment, we can use an unpaired two sample t-test to compare the positive rates between the blue and red groups of states. In our case, the null hypothesis is that there is no difference in positive rates between red and blue states. Our alternative hypothesis the positive infection rates were different between blue and red states. We can use scipy to perform this test.

The t-test had a p-value of 0.0004 and a t-score of 3.8. The p-value is less than 0.05 which is the significance level at 95% confidence. This means that these values are statistically significant and that we can safely reject the null hypothesis. As such, it is safe to say that there is most likely a difference in positive rates between blue and red states.

Linear regression

However, we do not yet know the direction of this relationship - we know that there is a difference, but we do not yet know the direction of the relationship. We can determine the direction and shape of the relationship by plotting the percentage of republican votes and the positive rates of states on a scatterplot.

Looking at the scatterplot, there does seem to be a positive linear correlation present. We can then use sklearn to run a linear regression and statsmodels to get a p-value and see if there truly is one.

Based on these results, it seems that there exists a positive relationship between the percentage of Republican votes in the 2020 election and the positive cases in each state. For every increase of 1% in the republican votes of a state, the model predicts a 0.0013 increase in the percent of positive cases. The p-value for the coefficient is 0 which is less than the significance level of 0.05 used for a typical 95% confidence level. However, the p-value for the constant is 0.163, greater than the significance level of 0.05. This means that there is enough evidence to suggest that there is a relationship between the two variables (the slope is nonzero), but there is not enough evidence to suggest that the constant term differs from zero.

Overall, this means that the more republican a state is, the more positive infections it tends to have. Going back to our original question, it does seem like blue states seem to have less infections overall when compared to red states.

Linear Regression Modeling for COVID Statistics

In order to make any observations regarding states' responses to COVID, we need to establish a relationship between the data in the dataframe as measures of how well a state handled COVID. We have chosen the total number of people who have tested positive for COVID-19 so far as our independent variable for these models, and we will be examining three different dependent variables. One is the total number of people who have gone to the hospital for COVID-19 so far, including those who have since recovered or died. Another is the total number of people who have recovered from COVID-19 so far. Finally, we have the total number of people who have died as a result of COVID-19 so far. Based on context, these values should be dependent on positive cases, and how well the state is handling positive cases. We will construct a linear regression model which depicts the relationship between positive cases and each of these dependent variables. Depending on how well each state is handling their COVID cases, we will be able to see where they lie in comparison to other states based on the linear regression line.

Graphing the Data and Lin Reg Models

Using the cleaned up dataframe lrdata, we can now create linear regression models which will establish a relationship between the data and different states. We will be creating three graphs with linear regression models for 3 different relationships. The method used for each model is the same. First, we will drop rows with NaN values in the columns that we are graphing, since those values will interfere with modeling. First, we will graph all of the points as a scatter plot, labeling each point with the state it represents. Then we will use numpy polyfit() to create a linear regression model for this relationship, and graph it as well. We will then use this depiction of the data to make observations for the states regarding each relationship.

Positive Cases vs. Cumulative Hospitalized

Above, we see a depiction of the relationship between positive cases and the number of people who are hospitalized. Intuitively, a higher amount of positive cases suggests more people who are hospitalized, and the linear regression line supports this. The states with their point above this line have more hospitalized than the lin reg model expects at that number of positive cases, while states with their point below this line have less hospitalized than the lin reg model expects at that number of positive cases. It is ideal for a state to be below the line, since that suggests they are doing well at mitigating hospitalization among positive cases in comparison to other states. Some states which are notable for this in the plot include VT, NH, PA, and IA, and on the other hand, AL, whose point lies far above the line, may not be handling hospitalization rates as well.

Positive Cases vs. Cumulative Recovered

Plotted above, we see a depiction of the relationship between positive cases and the number of people who have recovered. Intuitively, a higher amount of positive cases suggests that more people would have recovered, and the linear regression line supports this. The states with their point above this line have more people recovered than the lin reg model expects at that number of positive cases, while states with their point below this line have less people recovered than the lin reg model expects at that number of positive cases. It is ideal for a state to be above the line, since that suggests they are doing well at encouraging recovery among positive cases in comparison to other states. Some states which are notable for this in the plot include VT, NH, and IA, and on the other hand, CT and KY, whose points lies far below the line, may not be supporting and speeding up recovery as well.

Positive Cases vs. Cumulative Death

Above, we see a depiction of the relationship between positive cases and the number of people who have died. Intuitively, a higher amount of positive cases suggests there are more people who have died, and the linear regression line supports this. The states with their point above this line have more deaths than the lin reg model expects at that number of positive cases, while states with their point below this line have less deaths than the lin reg model expects at that number of positive cases. It is ideal for a state to be below the line, since that suggests they are doing well at mitigating the number of deaths among positive cases in comparison to other states. Some states which are notable for this in the plot include AK, CA, and UT, and on the other hand, NJ, MA, and CT, whose points lies far above the line, may not be as successful at lowering death rates among positive COVID cases.

General Observations of Lin Reg Models

Now that we have compared the states based on three different linear regression models, now it's time to see if we can make any general observations. First, I will look across the three plots and see if there's a state that does consistently well in all of the models. It is noticable that VT, NH, IA, and PA did well in terms of keeping hospitalization rates low and supporting high recovery rates, and the states AK and CA did well in terms of keeping death rates low. On the other hand, some states did poorly with managing these aspects of COVID, such as AL, CT, and KY. It is interesting to note that based on our analysis which categorizes all of these states into red and blue states, that a majority of the states that did well in terms of handling covid are generally blue states (4/6), and the states that did not do so well were mostly red states (2/3). These observations support our hypothesis that blue states handled COVID better than red states in the U.S.

Using K-Means Clustering to Group States

Since we are looking to see if there's a difference in how states compare to each other in terms of COVID data based on political leaning, as further analysis, we can also try using k-means clustering on our data to see if states are clustered into distinct groups based on their COVID statistics. Particularly, we want to see if we get two distinct groups that match the political leanings of red vs blue.

Data Cleaning

We can start out by simply dropping irrelevant columns. Since the majority of our columns, and in particular important columns such as positive and negative, are cumulative, we'll look at cumulative values for each state.

We can see that many of these columns have NaN values. K-means clustering cannot be done with NaN values, and there is no accurate way for us to determine what the missing values may be. As such, we'll simply remove all columns with NaN values. The extra benefit of this is that k-means clustering becomes inaccurate with too many dimensions, so we would have needed to remove columns or do dimension reduction regardless.

As previously stated, we want to perform our data analysis on per capita rates rather than raw numbers, so that results are not inaccurately biased by differing state populations. Since most of our data is in 2020, we'll use the April 1, 2020 population estimates.

We can see from the first few data points that the different variables seem to be on different scales. We'll standardize our data points to prevent inaccuracies due to unit differences.

Clustering with Three Dimensions

Since our data has three dimensions, which is relatively low, we can go ahead and try to run k-means clustering on it. First, to get a sense of what our data looks like, we can make a 3D scatterplot of the points.

From just looking at the graph, we don't seem to have two distinct groups. Still, we can do clustering on this and replot to see how the different groups fall. Since we're looking to find a difference between red and blue states, we'll use two clusters. We use k-means++ to make sure our initial clusters are selected well for accurate results.

From our graph, we can see that Cluster 1 has much more states than Cluster 2. Cluster 2 seems to have an overall above average rates of deaths and a comparatively larger rate of testing. Cluster 2 has much fewer states and has clearly below average rates of positive cases and comparatively lower rates of testing. To gain more clarity on how accurate our groupings are to political leanings, we can see what fraction of each cluster corresponds to which political party.

Cluster 1, the larger group, seems somewhat closely split between red and blue states, although the red states do outnumber blue. Cluster 2 is majority blue states, with only one red state (Alaska). If we consider how red vs blue states fall between the two groups, we note that while a significant portion of blue states fall into Cluster 2, all but one red state fall into Cluster 1. As Cluster 1 seems to be the grouping of states performing worse overall (higher rates of deaths, positive cases, and testing compared to Cluster 2), this matches up to some extent with our previous findings. However, it is important to note that the majority of blue states still also ended up in Cluster 1, which indicates that most blue states still do not see a significant improvement in terms of these three variables as compared to red states.

Clustering with Two Dimensions

Our previous clustering included three variables: positive rates, death rates, and testing rates. The testing rate variable still seems to be a bit of an outlier. Although one may be able to connect testing rates to close contacts and risk of exposure, people's reasons for testing still vary from test to test, and a test doesn't necessarily mean a positive case. Compared to the other two variables, testing rates don't indicate as clearly how successful or unsuccessful a state is in terms of how they deal with COVID. Furthermore, most studies regarding COVID rates in red versus blue states do not consider total test numbers. For example, a BMC Public Health article considering the effects of politics on COVID talks about vaccination, positive, and death rates, but not rates of testing. As such, it may be more accurate to remove the total_test_results column and only focus on positive rates and death rates for our k-means clustering.

Similar to our previous scatterplot, we don't see two clear groupings in how states are spread out. We'll run the same k-means clustering algorithm to see how states are grouped when only considering death rates and positive test rates.

Although not exactly the same, the grouping seems to be similar to our previous 3D clustering. Cluster 1 and Cluster 2 seem to be switched from before. Here, Cluster 1 is the smaller grouping, and we see clearly that it has below average rates of both positive cases and deaths. Cluster 2 is the larger grouping, and overall seems to have above average rates of positive cases and deaths. Like last time, we'll take a look at the fraction of blue to red states in each cluster to see how our clusters match up with political leanings.

This grouping is almost exactly the same as before if we switch Cluster 1 and Cluster 2. We note that compared to the previous smaller cluster, our current smaller cluster (Cluster 1) has two more states: MD and VA (all other states are the same). As before, our smaller cluster has a strong majority of blue states, with only one red state (AK). Our larger cluster, Cluster 2, has all the other red states, as well as a large number of blue states. Again, this corresponds to how in general, blue states seem to be doing better, as the smaller group which is doing notably better is comprised of mainly blue states. However, as before, we have many blue states in the larger group as well. The clustering indicates that in terms of their positive and death rates, the 17 blue states in Cluster 2 are closer to the majority of red states than they are to the other cluster. To see if there is any relation between party and performance within the cluster, we can try replotting the scatterplot, distinguishing both cluster and party. We leave the state labels off for this plot so that the color patterns are easier to see.

In Cluster 1, the blue states seem to follow a general line, while the red state is an outlier (comparatively higher positive rates but very low death rates). In Cluster 2, red and blue states seem to be fairly evenly mixed.

Because Cluster 2 contains the majority of states and has a good mix of both red and blue states, we might consider Cluster 2 as a representation of how most states, and the US in general, are doing with COVID. Cluster 1 is the section of states that are doing comparatively better, and the fact that the majority of these states are blue may help to explain why we found a correlation between party and postive rates. Although we don't have a clear grouping by political leaning when we use k-means clustering, our findings still support the idea that blue states as a group are performing better than red states in terms of COVID response.

Conclusion

The purpose of our project is to analyze a table of COVID statistics from each state, and conduct analysis through parsing, modifying, and graphing this data. We strive to then hypothesize how the actions and policies of different states may have effects on specific COVID statistics, such as how many positive cases a state accumulates over time, and also how many of those positive cases transfer into hospitalization cases, recoveries, or deaths.

Although it is difficult to conclude exactly what causes the difference in positive infections between blue and red states, we can assume that some of the differences in policy surrounding the vaccination mentioned earlier (mask mandates, vaccine requirements, general attitude surrounding compliance with the policies, etc.) were likely a big factor in why such a difference exists. A lot of these health issues health have turned partisan and it is clearly having an effect on citizens. Other factors like lack of accessibility to pharmacies and other health facilities or population density could potentially play a factor in these results.