In [22]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns # for visualiation

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler       # scaling data
from sklearn.neighbors import KNeighborsRegressor    # regressor

from sklearn.model_selection import GridSearchCV     # for grid search
from sklearn.pipeline import make_pipeline           # for making pipelines
from sklearn.preprocessing import LabelEncoder

# Assess the accuracy of predictions
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from IPython.display import HTML
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns # for visualiation
import scipy
from datetime import datetime
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_classification
import sklearn as sk
from sklearn import ensemble
from sklearn import metrics
from sklearn.neural_network import MLPRegressor
import matplotlib.patches as mpatches
import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_columns = None
%matplotlib inline
import statsmodels.formula.api as smf # linear modeling
import warnings
warnings.filterwarnings('ignore')

shooting_and_sales_per_pop = pd.read_csv('./data/shootings_and_sales_per_pop.csv')
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Show raw code."></form>''')
Out[22]:

Analysis of U.S. Gun Sales

Within the last decade, the topic of gun violence and gun control has become a serious issue with opinions becoming very polar and political on what solution is the safest and most practicle for a modern America. Many people believe that the United States is in an arms race against itself and the need for citizen to carry firearms is obsolete. Others believe it is natural to want to protect yourself and your home with a firearm.

With the wide spectrum of opinion that has grown in America, we wanted to pose a question near the root of the problem: Can gun sales be predicted by mass shootings?

Data Sources

We gathered data from a wide variety of sources in our research. The first source we used was from the Gun Violence Archive which provided key data on our primary variable of interest: gun violence. The GVA was founded in 2013 to provide public data on mass shootings. This data provided a baseline for the timeseries of our models.

The second source we used was the National Criminal Background Check System. From this source we extrapelated the total number of gun sales, which is our dependent variable.

Finally, we used population data from the U.S. Census Bureau to normalize different features on the state level. Unfortunately, this data is yearly rather than monthly so it is not as accurate. However, we felt that this still accurately represented the population so we just used the yearly counts.

Data Cleaning and Feature Creation

The data preperation was a three part process. The first step in preparing the data was to concatonate all of the mass shooting data by year into one dataframe. This was convienent since all of our data came from the same source. The second step focused on gun sales in the US. This data was also already packaged nicely, the only work needed was to split up the month and year to force them into integers for merging and totaling the counts for the US as a whole. Then, state population was added to the sales dataframe to later create features relative to population.

The final step in creating the dataframe was to merge the shooting data with the gun sale data. Once that was complete, we were ready for feature selection. The first feature we decided to create was the number of people affected, which is simply number of people injured and number of people killed. Then to quantify large incidents we created a boolean that measured if over 10 people were affected in the incident. The next two features we added the average number of of deaths and injuries based on the number of incidents in each month. Finally, we normalized the values based on population of the state in which they occured.

  • state, year, month: These are our indexing columns, data is sorted by the month and year for each state.
  • population: The total yearly population of the state applied over all the months of the year.
  • handgun: The number of background checks conducted for handguns.
  • long_gun: The number of background checks conducted for long guns.
  • multiple: The number of background checks conducted for multiple guns.
  • other: The number of background checks conducted for other guns.
  • total: The total number of background checks conducted in a particular state, year and month.
  • killed: The number of people killed each month in mass shooting incidents in each state.
  • killed_max: The highest amount of people kill in a single incident in a month.
  • injured: The number of people killed each month in mass shooting incidents in each state.
  • incidents: The number of mass shooting incidents in a month in a state.
  • injured_max: The highest amount of people injured in a single incident in a month.
  • affected_max: The highest amount of people affected in a single incident in a month.
  • affected_over_10: The number of incidents in a month that affected more than 10 people.
  • killed_ave: The average number of people killed per incident in each month.
  • injured_ave: The average number of people injured per incident in each month.
  • handgun_pop: Handgun count normalized per capita for each state in a given year.
  • long_gun_pop: Long gun count normalized per capita for each state in a given year.
  • multiple_pop: Mulitple gun count normalized per capita for each state in a given year.
  • other_pop: Other gun count normalized per capita for each state in a given year..
  • permit_pop: Number of permits rechecked normalized per capita for each state in a given year.
  • totals_pop: Total number of permits normalized per capita for each state in a given year.
  • killed_pop: Total number of people killed normalized per capita for each state in a given year.
  • injured_pop: Total number of people injured normalized per capita for each state in a given year.
  • incidents_pop: Total number of mass shooting incidents normalized per capita for each

Exploration of Data

Our final dataframe was a 3120 rows by 30 columns dataframe which contained gun sale, population, and mass shooting information for each state monthly. To better understand this large dataset we looked for visual relationships between our potential predictor variables and gun sales. The following is a set of insights we gained from our research.

Seasonality within both gun sales and mass shootings:

In [23]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns # for visualiation

# creating Dataframes
gunsale_population_by_shooting = pd.read_csv('./data/shootings_and_sales_per_pop.csv')
united_states_totals = gunsale_population_by_shooting[gunsale_population_by_shooting.state ==  "United States"]

# First graphs size, rows, title and color set.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,6))
fig.suptitle('Seasonality of background checks and mass shootings from 2013 - 2017')
palette = sns.color_palette("bright", 5)

# Plotting each
sns.lineplot(x="month", y="totals",
   hue="year", palette = palette,
   data=united_states_totals, ax=ax1)
ax1.set(xlabel='Month', ylabel='Total Firearm Background Checks')

sns.lineplot(x="month", y="incidents",
            hue="year", palette = palette,
            data=united_states_totals, ax=ax2)
ax2.set(xlabel='Month', ylabel='Nunber of Mass shooting incidents')

plt.show()

The seasons of each of these are almost opposite to each other. Gun sales peak towards December while mass shooting incidents increase towards the end of the summer. One important characteristic to notice is that specific events can be seen on the background checks graph. For example June of 2016 is an outlier which is when the Orlando Massacre happened.

How do state's number of mass shootings and the number of background checks relate.

In [24]:
# Remove the country totals
states_only = gunsale_population_by_shooting[gunsale_population_by_shooting.state != "United States"]

# Aggregates for different years.
states_2013_totals = states_only[states_only["year"] == 2013].groupby(["state","year"]).sum()
states_2017_totals = states_only[states_only["year"] == 2017].groupby(["state","year"]).sum()

# First graphs size, rows, title and color set.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,6))
fig.suptitle('Number of Mass Shooting incidents VS Total Firearm Background Checks by State 2013 & 2017')
palette = sns.color_palette("bright", 5)


# Plotting
sns.regplot(states_2013_totals["totals"],
                states_2013_totals["incidents"], ax = ax1)
ax1.set(xlabel='Total Firearm Background Checks', ylabel='Number of Mass Shooting Incidents')

# Adds labels to rows with state name that I preselected.
for line in range(0,states_2013_totals.shape[0]):
    if states_2013_totals.reset_index().state[line] in ["California", "Texas", "Illinois", "Kentucky"]:
        ax1.text(states_2013_totals.totals[line]+0.2, states_2013_totals.incidents[line],
                 states_2013_totals.reset_index().state[line], horizontalalignment='left', size='medium',
                 color='black', weight='semibold')


sns.regplot(states_2017_totals["totals"],
                states_2017_totals["incidents"], ax = ax2)
ax2.set(xlabel='Total Firearm Background Checks', ylabel='Number of Mass Shooting Incidents')

# Adds labels to rows with state name that I preselected.
for line in range(0,states_2017_totals.shape[0]):
    if states_2017_totals.reset_index().state[line] in ["California", "Texas", "Illinois", "Kentucky", "Florida", "Ohio"]:
        ax2.text(states_2017_totals.totals[line]+0.2, states_2017_totals.incidents[line],
                 states_2017_totals.reset_index().state[line], horizontalalignment='left', 
                 size='medium', color='black', weight='semibold')

plt.show()

While it may look like there is some sort of correlation from this regression plot, there are luckily a lot of states which don't have mass shootings which skews this graph. It seems the state itself has a large influence on the number of mass shootings such as Illinois and California.

What the background checks totals is made up by:

In [25]:
# Used for the time series.
dates = pd.DataFrame({"month": united_states_totals.month,
                      "year": united_states_totals.year,
                      "day": 1
                     })

# Created a seperate data frame for the stacked plot
just_gun_sales = pd.DataFrame({"date": pd.to_datetime(dates),
                              "long gun checks": united_states_totals.long_gun,
                               "handgun checks": united_states_totals.handgun,
                               "multi-gun checks": united_states_totals.multiple,
                               "permit checks": united_states_totals.permit,
                               "other": united_states_totals.other
                              })

stacked_check_plot = just_gun_sales.plot.area(x="date", title="Monthly Background Checks")
stacked_check_plot.set(xlabel="Date", ylabel="Number of Background checks in Month")
stacked_check_plot
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a28fb1e48>

Multi-gun firearm background checks are pretty rare in comparison to the other categories which helps our model be more accurate as we are trying to estimate the number of guns sold using the checks. Here you can again see a seasonality of the background checks as well as spikes from certain events like the 2016 Orlando massacre.

How does the population of a state influence the number of background checks:

In [26]:
sns.set(style="ticks")
#first_month = states_only[(states_only.year == 2013) & (states_only.month == 1)]
last_month = states_only[(states_only.year == 2017) & (states_only.month == 12)]
pop_totals_graph = sns.jointplot(last_month.population, last_month.totals, kind="hex", height=10, color="#4CB391")
pop_totals_graph.set_axis_labels('State Population in Millions', 'Number of Background checks in Month', fontsize=16).fig.suptitle('December 2017')
pop_totals_graph
Out[26]:
<seaborn.axisgrid.JointGrid at 0x1a28e99860>

We wanted to understand if it was reasonable to normalize the number of background checks by the states population. The distribution of both appear to be right skewed, but from our graph the relationship between the two seem pretty clear. There are outliers such as Kentucky, but overall we assumed it was a smart idea to normalize for the population of the state.

Feature Selection

In [27]:
overall = pd.read_csv('./data/shootings_and_sales_per_pop.csv')
us_df = overall[overall.state=='United States']
df = overall[overall.state!='United States']
training = ['year','population','month',
            'killed','state',
            'killed_max','injured','incidents',
            'injured_max','affected_max','affected_over_10',
            'killed_ave','injured_ave',
            'killed_pop','injured_pop','incidents_pop']
testing = ['totals_pop']

le = sk.preprocessing.LabelEncoder()
le.fit(df.state)
df.state = le.transform(df.state)
us_df.state = 0

df.totals_pop = df.totals_pop.apply(lambda x:int(x))
us_df.totals_pop = us_df.totals_pop.apply(lambda x:int(x))


train = sk.preprocessing.scale(df[training])
us_train = sk.preprocessing.scale(us_df[training])
In [28]:
# Build a classification task using 3 informative features
X_train, X_test, y_train, y_test  = sk.model_selection.train_test_split(train,df[testing],train_size=.8,random_state=10)

# Create the RFE object and compute a cross-validated score.
svr = sk.svm.SVR(kernel='linear')
# The "accuracy" scoring is proportional to the number of correct
# classifications
rfecv = RFECV(estimator=svr, step=1, cv=StratifiedKFold(10),
              scoring='neg_mean_squared_error')
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()
rfecv.fit(X_train, y_train)

print("Optimal number of features : %d" % rfecv.n_features_)

# Plot number of features VS. cross-validation scores
plt.figure()
plt.title('Recursive Feature Elimination with Cross Validation')
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation(5) score (neg_mean_squared_error)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
Optimal number of features : 10

Here are our final features

In [29]:
ranked = rfecv.ranking_.copy()
ranked[ranked==1]=True
ranked[ranked!=True]=False
features = []
for t,r in zip(training,ranked):
    if(r):
        features.append(t)
        
features
Out[29]:
['year',
 'population',
 'month',
 'state',
 'injured',
 'incidents',
 'injured_ave',
 'killed_pop',
 'injured_pop',
 'incidents_pop']

Models

Statistical Model

As for the statistical modeling, we first wanted to see which features have correlations with each other and most importantitly we wanted to identify which feature specifically correlates with the number of gun sales or total gun sales (_totals_pop_) for each state, year and month.

In [30]:
shooting_and_sales_per_pop = shooting_and_sales_per_pop[shooting_and_sales_per_pop.state != 'United States']
# Select optimal features for your model
corr = shooting_and_sales_per_pop.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(11,9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a25585c88>

Total Firearms background checks by population each Month in each state

Next we chose to use linear regression to do our statistical analysis and predict gun sales for each state given features like or a combination of features such as:

formula_1 = totals_pop ~ year + population + month + state + injured + incidents + injured_ave + killed_pop + injured_pop + incidents_pop.

The reason we chose linear regression is because our data contains continuous numerical values that is easy to test against a independent variable like time (year, month).

In [32]:
full_model =  smf.ols(formula='totals_pop ~ year + population + month + state + injured + incidents + injured_ave + killed_pop + injured_pop + incidents_pop', data=shooting_and_sales_per_pop).fit()
full_model.summary()
Out[32]:
OLS Regression Results
Dep. Variable: totals_pop R-squared: 0.785
Model: OLS Adj. R-squared: 0.781
Method: Least Squares F-statistic: 185.9
Date: Wed, 05 Dec 2018 Prob (F-statistic): 0.00
Time: 19:40:40 Log-Likelihood: -22735.
No. Observations: 3060 AIC: 4.559e+04
Df Residuals: 3000 BIC: 4.595e+04
Df Model: 59
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -7.902e+04 1.19e+04 -6.656 0.000 -1.02e+05 -5.57e+04
state[T.Alaska] -468.9021 256.980 -1.825 0.068 -972.778 34.974
state[T.Arizona] -400.8092 139.260 -2.878 0.004 -673.864 -127.754
state[T.Arkansas] -512.6301 134.777 -3.804 0.000 -776.895 -248.365
state[T.California] 2742.9990 2038.618 1.346 0.179 -1254.232 6740.230
state[T.Colorado] -239.5561 83.051 -2.884 0.004 -402.399 -76.713
state[T.Connecticut] -528.9792 106.263 -4.978 0.000 -737.334 -320.624
state[T.Delaware] -1000.1604 245.171 -4.079 0.000 -1480.880 -519.441
state[T.District of Columbia] -1442.0224 261.484 -5.515 0.000 -1954.728 -929.317
state[T.Florida] 996.4571 923.785 1.079 0.281 -814.858 2807.772
state[T.Georgia] -51.5707 328.200 -0.157 0.875 -695.091 591.950
state[T.Hawaii] -1292.3417 218.326 -5.919 0.000 -1720.426 -864.257
state[T.Idaho] -602.0182 205.165 -2.934 0.003 -1004.296 -199.741
state[T.Illinois] 732.9636 482.744 1.518 0.129 -213.580 1679.507
state[T.Indiana] 295.5659 129.433 2.284 0.022 41.779 549.352
state[T.Iowa] -787.4543 128.002 -6.152 0.000 -1038.434 -536.474
state[T.Kansas] -672.5187 138.483 -4.856 0.000 -944.050 -400.988
state[T.Kentucky] 4790.1987 79.370 60.353 0.000 4634.574 4945.823
state[T.Louisiana] -421.4438 76.333 -5.521 0.000 -571.115 -271.773
state[T.Maine] -785.8199 223.448 -3.517 0.000 -1223.947 -347.692
state[T.Maryland] -692.5737 101.665 -6.812 0.000 -891.914 -493.233
state[T.Massachusetts] -594.1349 138.219 -4.299 0.000 -865.149 -323.121
state[T.Michigan] -105.1696 312.288 -0.337 0.736 -717.490 507.151
state[T.Minnesota] -107.8436 84.429 -1.277 0.202 -273.388 57.701
state[T.Mississippi] -536.3152 134.371 -3.991 0.000 -799.785 -272.846
state[T.Missouri] -135.8663 104.844 -1.296 0.195 -341.440 69.708
state[T.Montana] -372.0032 240.415 -1.547 0.122 -843.399 99.392
state[T.Nebraska] -980.8369 192.016 -5.108 0.000 -1357.334 -604.340
state[T.Nevada] -847.2086 139.463 -6.075 0.000 -1120.660 -573.757
state[T.New Hampshire] -511.5946 223.371 -2.290 0.022 -949.570 -73.619
state[T.New Jersey] -527.4796 256.754 -2.054 0.040 -1030.912 -24.047
state[T.New Mexico] -701.4599 181.527 -3.864 0.000 -1057.390 -345.530
state[T.New York] 611.8156 895.904 0.683 0.495 -1144.833 2368.464
state[T.North Carolina] 51.6002 319.645 0.161 0.872 -575.145 678.345
state[T.North Dakota] -633.3233 256.630 -2.468 0.014 -1136.512 -130.135
state[T.Ohio] 164.6444 410.482 0.401 0.688 -640.210 969.499
state[T.Oklahoma] -368.8373 94.347 -3.909 0.000 -553.828 -183.847
state[T.Oregon] -519.2400 90.022 -5.768 0.000 -695.752 -342.728
state[T.Pennsylvania] 427.3682 480.052 0.890 0.373 -513.897 1368.633
state[T.Rhode Island] -1227.2478 238.866 -5.138 0.000 -1695.606 -758.890
state[T.South Carolina] -434.3286 75.265 -5.771 0.000 -581.905 -286.752
state[T.South Dakota] -489.8120 250.353 -1.956 0.051 -980.692 1.068
state[T.Tennessee] -33.7937 128.440 -0.263 0.792 -285.633 218.045
state[T.Texas] 1700.8488 1350.018 1.260 0.208 -946.205 4347.903
state[T.Utah] -439.9549 134.084 -3.281 0.001 -702.860 -177.049
state[T.Vermont] -983.0153 263.534 -3.730 0.000 -1499.740 -466.290
state[T.Virginia] -201.7917 222.926 -0.905 0.365 -638.895 235.311
state[T.Washington] -147.1962 157.845 -0.933 0.351 -456.692 162.300
state[T.West Virginia] -265.8508 195.178 -1.362 0.173 -648.546 116.845
state[T.Wisconsin] -301.3672 93.051 -3.239 0.001 -483.818 -118.916
state[T.Wyoming] -568.1382 265.915 -2.137 0.033 -1089.533 -46.744
year 39.9643 5.957 6.709 0.000 28.285 51.644
population -9.947e-05 5.98e-05 -1.664 0.096 -0.000 1.77e-05
month 2.7826 2.160 1.288 0.198 -1.453 7.018
injured 3.1093 5.941 0.523 0.601 -8.539 14.757
incidents -19.0183 25.166 -0.756 0.450 -68.362 30.325
injured_ave -1.0871 6.647 -0.164 0.870 -14.120 11.946
killed_pop -137.6268 184.631 -0.745 0.456 -499.643 224.389
injured_pop -48.6501 185.239 -0.263 0.793 -411.858 314.557
incidents_pop 297.6218 905.673 0.329 0.742 -1478.181 2073.425
Omnibus: 2603.561 Durbin-Watson: 0.680
Prob(Omnibus): 0.000 Jarque-Bera (JB): 941060.931
Skew: -3.074 Prob(JB): 0.00
Kurtosis: 88.692 Cond. No. 1.53e+10


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.53e+10. This might indicate that there are
strong multicollinearity or other numerical problems.
In [57]:
shooting_and_sales_per_pop['mult_pred']=full_model.predict()
shooting_and_sales_per_pop.plot('totals_pop', 'mult_pred', kind='scatter')
plt.plot(shooting_and_sales_per_pop.totals_pop, shooting_and_sales_per_pop.totals_pop, color='red')
plt.show()
In [56]:
full_model__2 =  smf.ols(formula='totals_pop ~ year + population + month + state', data=shooting_and_sales_per_pop).fit()
full_model.summary()
Out[56]:
OLS Regression Results
Dep. Variable: totals_pop R-squared: 0.785
Model: OLS Adj. R-squared: 0.781
Method: Least Squares F-statistic: 185.9
Date: Wed, 05 Dec 2018 Prob (F-statistic): 0.00
Time: 19:52:02 Log-Likelihood: -22735.
No. Observations: 3060 AIC: 4.559e+04
Df Residuals: 3000 BIC: 4.595e+04
Df Model: 59
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -7.902e+04 1.19e+04 -6.656 0.000 -1.02e+05 -5.57e+04
state[T.Alaska] -468.9021 256.980 -1.825 0.068 -972.778 34.974
state[T.Arizona] -400.8092 139.260 -2.878 0.004 -673.864 -127.754
state[T.Arkansas] -512.6301 134.777 -3.804 0.000 -776.895 -248.365
state[T.California] 2742.9990 2038.618 1.346 0.179 -1254.232 6740.230
state[T.Colorado] -239.5561 83.051 -2.884 0.004 -402.399 -76.713
state[T.Connecticut] -528.9792 106.263 -4.978 0.000 -737.334 -320.624
state[T.Delaware] -1000.1604 245.171 -4.079 0.000 -1480.880 -519.441
state[T.District of Columbia] -1442.0224 261.484 -5.515 0.000 -1954.728 -929.317
state[T.Florida] 996.4571 923.785 1.079 0.281 -814.858 2807.772
state[T.Georgia] -51.5707 328.200 -0.157 0.875 -695.091 591.950
state[T.Hawaii] -1292.3417 218.326 -5.919 0.000 -1720.426 -864.257
state[T.Idaho] -602.0182 205.165 -2.934 0.003 -1004.296 -199.741
state[T.Illinois] 732.9636 482.744 1.518 0.129 -213.580 1679.507
state[T.Indiana] 295.5659 129.433 2.284 0.022 41.779 549.352
state[T.Iowa] -787.4543 128.002 -6.152 0.000 -1038.434 -536.474
state[T.Kansas] -672.5187 138.483 -4.856 0.000 -944.050 -400.988
state[T.Kentucky] 4790.1987 79.370 60.353 0.000 4634.574 4945.823
state[T.Louisiana] -421.4438 76.333 -5.521 0.000 -571.115 -271.773
state[T.Maine] -785.8199 223.448 -3.517 0.000 -1223.947 -347.692
state[T.Maryland] -692.5737 101.665 -6.812 0.000 -891.914 -493.233
state[T.Massachusetts] -594.1349 138.219 -4.299 0.000 -865.149 -323.121
state[T.Michigan] -105.1696 312.288 -0.337 0.736 -717.490 507.151
state[T.Minnesota] -107.8436 84.429 -1.277 0.202 -273.388 57.701
state[T.Mississippi] -536.3152 134.371 -3.991 0.000 -799.785 -272.846
state[T.Missouri] -135.8663 104.844 -1.296 0.195 -341.440 69.708
state[T.Montana] -372.0032 240.415 -1.547 0.122 -843.399 99.392
state[T.Nebraska] -980.8369 192.016 -5.108 0.000 -1357.334 -604.340
state[T.Nevada] -847.2086 139.463 -6.075 0.000 -1120.660 -573.757
state[T.New Hampshire] -511.5946 223.371 -2.290 0.022 -949.570 -73.619
state[T.New Jersey] -527.4796 256.754 -2.054 0.040 -1030.912 -24.047
state[T.New Mexico] -701.4599 181.527 -3.864 0.000 -1057.390 -345.530
state[T.New York] 611.8156 895.904 0.683 0.495 -1144.833 2368.464
state[T.North Carolina] 51.6002 319.645 0.161 0.872 -575.145 678.345
state[T.North Dakota] -633.3233 256.630 -2.468 0.014 -1136.512 -130.135
state[T.Ohio] 164.6444 410.482 0.401 0.688 -640.210 969.499
state[T.Oklahoma] -368.8373 94.347 -3.909 0.000 -553.828 -183.847
state[T.Oregon] -519.2400 90.022 -5.768 0.000 -695.752 -342.728
state[T.Pennsylvania] 427.3682 480.052 0.890 0.373 -513.897 1368.633
state[T.Rhode Island] -1227.2478 238.866 -5.138 0.000 -1695.606 -758.890
state[T.South Carolina] -434.3286 75.265 -5.771 0.000 -581.905 -286.752
state[T.South Dakota] -489.8120 250.353 -1.956 0.051 -980.692 1.068
state[T.Tennessee] -33.7937 128.440 -0.263 0.792 -285.633 218.045
state[T.Texas] 1700.8488 1350.018 1.260 0.208 -946.205 4347.903
state[T.Utah] -439.9549 134.084 -3.281 0.001 -702.860 -177.049
state[T.Vermont] -983.0153 263.534 -3.730 0.000 -1499.740 -466.290
state[T.Virginia] -201.7917 222.926 -0.905 0.365 -638.895 235.311
state[T.Washington] -147.1962 157.845 -0.933 0.351 -456.692 162.300
state[T.West Virginia] -265.8508 195.178 -1.362 0.173 -648.546 116.845
state[T.Wisconsin] -301.3672 93.051 -3.239 0.001 -483.818 -118.916
state[T.Wyoming] -568.1382 265.915 -2.137 0.033 -1089.533 -46.744
year 39.9643 5.957 6.709 0.000 28.285 51.644
population -9.947e-05 5.98e-05 -1.664 0.096 -0.000 1.77e-05
month 2.7826 2.160 1.288 0.198 -1.453 7.018
injured 3.1093 5.941 0.523 0.601 -8.539 14.757
incidents -19.0183 25.166 -0.756 0.450 -68.362 30.325
injured_ave -1.0871 6.647 -0.164 0.870 -14.120 11.946
killed_pop -137.6268 184.631 -0.745 0.456 -499.643 224.389
injured_pop -48.6501 185.239 -0.263 0.793 -411.858 314.557
incidents_pop 297.6218 905.673 0.329 0.742 -1478.181 2073.425
Omnibus: 2603.561 Durbin-Watson: 0.680
Prob(Omnibus): 0.000 Jarque-Bera (JB): 941060.931
Skew: -3.074 Prob(JB): 0.00
Kurtosis: 88.692 Cond. No. 1.53e+10


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.53e+10. This might indicate that there are
strong multicollinearity or other numerical problems.

formula_2 = totals_pop ~ year + population + month + state.

In [58]:
shooting_and_sales_per_pop['mult_pred_2']=full_model.predict()
shooting_and_sales_per_pop.plot('totals_pop', 'mult_pred_2', kind='scatter')
plt.plot(shooting_and_sales_per_pop.totals_pop, shooting_and_sales_per_pop.totals_pop, color='red')
plt.show()
In [59]:
# First graphs size, rows, title and color set.
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(17,6))
fig.suptitle('A combination of factors that contributed to number of background checks & mass shootings from 2013 - 2017')
palette = sns.color_palette("bright", 5)

# Plotting each
sns.lineplot(x="month", y="mult_pred",
   hue="year", palette = palette,
   data=shooting_and_sales_per_pop, ax=ax1)
ax1.set(xlabel='Month', ylabel='Total Firearm Background Checks Predicted')

sns.lineplot(x="month", y="incidents_pop",
            hue="year", palette = palette,
            data=shooting_and_sales_per_pop, ax=ax2)
ax2.set(xlabel='Month', ylabel='Number of Mass shooting incidents')

sns.lineplot(x="month", y="killed_pop",
            hue="year", palette = palette,
            data=shooting_and_sales_per_pop, ax=ax3)
ax3.set(xlabel='Month', ylabel='Number of Mass shooting killed')

plt.show()

Machine Learning Models

In [36]:
train = sk.preprocessing.scale(df[features])
us_train = sk.preprocessing.scale(us_df[features])
In [37]:
rf = ensemble.RandomForestRegressor()
gb = ensemble.GradientBoostingRegressor()
svr = sk.svm.SVR()
knn = sk.neighbors.KNeighborsRegressor()
mlp = sk.neural_network.MLPRegressor(max_iter=2000,solver='lbfgs')
train = sk.preprocessing.scale(df[features])

X_train, X_test, y_train, y_test  = sk.model_selection.train_test_split(train,df[testing],train_size=.8,random_state=10)


y_train = y_train.values.ravel()
y_test = y_test.values.ravel()


modelnames = ['random_forest','gradient_booster','support_vector_regression','knn','multi_layer_perceptron']

df_results = pd.DataFrame(data = {'model':modelnames,'mse':0,'mse_err':0})


models =[rf,gb,svr,knn,mlp]
i = 0

num_iters = 10

for model in models:
    
    mse = []
    
    for k in range(num_iters):
        model.fit(X_train,y_train)
        y_pred =  model.predict(X_test)
        mse.append(metrics.mean_squared_error(y_test,y_pred))
        
    
    
    for mname,mdata in zip(['mse'],[mse]):
        
        err_name = "{}_err".format(mname)
        ave = np.mean(mdata)
        df_results.loc[i,mname] = ave
        df_results.loc[i,err_name] = np.std(mdata) / np.sqrt(len(mdata))
    
    i+=1
    
#df_results contains the models and their mean squared error and error of that values.
In [60]:
colors = ['#FCCE00','#0375B4','#007849','#262228','#FF7F00']
colors = ['#0E0B16','#A239CA','#4717F6','#E7DFDD','#FF7F00']
colors = ['#008DAB','#FCE700','#E81D64','#4A4A4A','#FF7F00']

plt.rcParams["figure.figsize"] = (20,15)

bigfont = 20
smallfont = 14
model_names = ['random_forest','gradient_booster','support_vector_regression','knn','multilayer_perceptron']
score_names = ['mse']

patches = []
for model_name,color in zip(model_names,colors):
    patches.append(mpatches.Patch(color=color, label=model_name))



plt.bar(modelnames,df_results.loc[:,'mse'],yerr=df_results.loc[:,'mse_err'],color=colors)
#plt.xticks(np.arange(4),model_names,rotation=90,fontsize=smallfont)

plt.title('Initial Model Comparison',fontsize=bigfont,position=(0.5,1.05))
plt.ylabel('MSE')
plt.xlabel('Model')
plt.legend(handles=patches,fontsize=bigfont)
#plt.savefig('initial_model_compare.png',dpi=500,format='png')
Out[60]:
<matplotlib.legend.Legend at 0x1a2def81d0>
In [61]:
df_results
Out[61]:
model mse mse_err
0 random_forest 54836.071670 1.464443e+03
1 gradient_booster 88677.397057 3.955438e+01
2 support_vector_regression 493640.136304 0.000000e+00
3 knn 502383.548562 1.840688e-11
4 multi_layer_perceptron 251134.539102 5.070308e+04

Model of Choice

Based on the above initial model comparison, we decided to go forward with the random forest regressor for hyper tuning. Full code and grid search process for tuning are in elias_feature.ipynb.

A quick overview of the Random Forest Regressor is that it's an ensemble of decision trees, which proceed to vote for the resultant value from the parameters given, the decision trees are each given a different sample of the features in order to prevent overtraining and allow for different variables importance to be increased depending on how essential they are to predicting the correct result.

Interestingly, the tuned parameters for minimum samples leafs and minimum samples split are close to the default values, This makes sense given that our data isn't nearly extensive enough to justify large minimum sample leafs and splits for each decision tree. Max depth and n_estimators also reflect the small dataset, with limited depth preventing over training, and a low amount of estimators being needed to come to a conclusive decision.

In [62]:
rf = ensemble.RandomForestRegressor(random_state=10,
                                    criterion='mse',
                                    max_depth=12,
                                   max_features=8,
                                   min_samples_leaf=1,
                                   min_samples_split=2,
                                   n_estimators=10)
rf.fit(X_train,y_train)
y_pred =  rf.predict(X_test)

We did an 80-20 split of training to testing data, because our quanitity of data is low, so we wanted to have more data for training.

Training our Random Forest Regressor on this training data, and then predicting based on test data provides the following graph. The x-axis is a random assortment of months, but we can see that the mse is about 20% better after tuning than the default model.

In [63]:
X_train, X_test, y_train, y_test  = sk.model_selection.train_test_split(train,df[testing],train_size=.8,random_state=10)

plt.rcParams["figure.figsize"] = (20,15)
ax = y_test.reset_index(drop=True).plot()
pd.DataFrame(y_pred).plot(ax=ax)

plt.title('Guns sold actual vs rf predictions')
plt.ylabel('Guns Sold per 100,000')
plt.xlabel('random months from test data')
plt.legend(['actual guns sold','rf predictions guns sold'])
plt.show()
In [64]:
mse = metrics.mean_squared_error(y_test,y_pred)
print('The MSE of this graph is {}'.format(mse))
The MSE of this graph is 49108.127592249475

In order to look at how the model did across all data, we then predicted for every row of data we had, in which the below graph represents each state by month. The mse decreases, but thats expected as we run our model on data it's been trained on.

In [65]:
y_pred = rf.predict(train)
plt.rcParams["figure.figsize"] = (20,15)
ax = df[testing].reset_index(drop=True).plot()
pd.DataFrame(y_pred).plot(ax=ax)

plt.title('Guns sold actual vs rf predictions')
plt.ylabel('Guns Sold per 100,000')
plt.xlabel('state per month from 2013-2017')
plt.legend(['actual guns sold','rf predictions guns sold'])
plt.show()
In [66]:
mse = metrics.mean_squared_error(df[testing].reset_index(drop=True),y_pred)
print('The MSE of this graph is {}'.format(mse))
The MSE of this graph is 25727.749775950797

Having looked at the results at a state level, we then zoomed out to a national level to see how well our model predicts gun sales, and to compare seasonality importance with just shooting data importance.

Below we use a random forest regressor trained on national data to predict guns sold using the same feature set that we selected earlier.

In [67]:
us_rf = ensemble.RandomForestRegressor(random_state=10,
                                    criterion='mse',
                                    max_depth=12,
                                   max_features=8,
                                   min_samples_leaf=1,
                                   min_samples_split=2,
                                   n_estimators=10)
us_X_train, us_X_test, us_y_train, us_y_test  = sk.model_selection.train_test_split(us_train,us_df[testing],train_size=.8,random_state=10)
us_rf.fit(us_X_train,us_y_train)
Out[67]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=12,
           max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=10,
           verbose=0, warm_start=False)
In [68]:
y_pred_us = us_rf.predict(us_X_test)
plt.rcParams["figure.figsize"] = (20,15)
ax = us_y_test.reset_index(drop=True).plot()
pd.DataFrame(y_pred_us).plot(ax=ax)
# ax.xticks=
plt.title('National: Guns sold actual vs rf predictions')
plt.ylabel('Guns Sold per 100,000')
plt.xlabel('Monthly from 2013-2017')
plt.legend(['actual guns sold','rf predictions guns sold'])
plt.show()
In [69]:
mse = metrics.mean_squared_error(us_y_test.reset_index(drop=True),y_pred_us)
print('The MSE of this graph is {}'.format(mse))
The MSE of this graph is 20830.062500000004

Below we predict gun sales just using shooting incident data from the selected feature list, while removing year and month to remove the aspect of seasonality from our model

In [70]:
non_seasonal = ['population','state','injured','incidents','injured_ave','killed_pop','injured_pop','incidents_pop']
non_seasonal_train = sk.preprocessing.scale(df[non_seasonal])
non_seasonal_us_train = sk.preprocessing.scale(us_df[non_seasonal])
non_seasonal_rf = ensemble.RandomForestRegressor(random_state=10,
                                    criterion='mse',
                                    max_depth=12,
                                   max_features=8,
                                   min_samples_leaf=1,
                                   min_samples_split=2,
                                   n_estimators=10)
non_s_us_X_train, non_s_us_X_test, non_s_us_y_train, non_s_us_y_test  = sk.model_selection.train_test_split(non_seasonal_us_train,us_df[testing],train_size=.8,random_state=10)
non_seasonal_rf.fit(non_s_us_X_train,non_s_us_y_train)

non_s_y_pred = non_seasonal_rf.predict(non_s_us_X_test)
plt.rcParams["figure.figsize"] = (20,15)
ax = non_s_us_y_test.reset_index(drop=True).plot()
pd.DataFrame(non_s_y_pred).plot(ax=ax)

plt.title('National: Guns sold actual vs rf shootings based predictions')
plt.ylabel('Guns Sold per 100,000')
plt.xlabel('Monthly from 2013-2017')
plt.legend(['actual guns sold','shooting only rf predictions guns sold'])
plt.show()
In [71]:
mse = metrics.mean_squared_error(non_s_us_y_test.reset_index(drop=True),non_s_y_pred)
print('The MSE of this graph is {}'.format(mse))
The MSE of this graph is 24691.534166666665

Below we predict gun sales just using seasonality data, year and month, from the selected feature list, while removing all shooting data from the model.

In [72]:
seasonal = ['year','month','population','state']
seasonal_train = sk.preprocessing.scale(df[seasonal])
seasonal_us_train = sk.preprocessing.scale(us_df[seasonal])
seasonal_rf = ensemble.RandomForestRegressor(random_state=10,
                                    criterion='mse',
                                    max_depth=12,
                                   max_features=4,
                                   min_samples_leaf=1,
                                   min_samples_split=
                                             2,
                                   n_estimators=10)
s_us_X_train, s_us_X_test, s_us_y_train, s_us_y_test  = sk.model_selection.train_test_split(seasonal_us_train,us_df[testing],train_size=.8,random_state=10)
seasonal_rf.fit(s_us_X_train,s_us_y_train)

s_y_pred = seasonal_rf.predict(s_us_X_test)
plt.rcParams["figure.figsize"] = (20,15)
ax = s_us_y_test.reset_index(drop=True).plot()
pd.DataFrame(s_y_pred).plot(ax=ax)

plt.title('National: Guns sold actual vs rf seasonal based predictions')
plt.xticks=s_us_y_test.index
plt.ylabel('Guns Sold per 100,000')
plt.xlabel('Monthly from 2013-2017')
plt.legend(['actual guns sold','rf predictions guns sold'])
plt.show()
In [73]:
mse = metrics.mean_squared_error(s_us_y_test.reset_index(drop=True),s_y_pred)
print('The MSE of this graph is {}'.format(mse))
The MSE of this graph is 11416.606666666668

Below we present all three models for predicting shootings on a national level, despite there being some differences in MSE between models, we can see that they're all fairly close in predictions.

In [74]:
plt.rcParams["figure.figsize"] = (20,15)
ax = us_y_test.reset_index(drop=True).plot()
pd.DataFrame(y_pred_us).plot(ax=ax)
pd.DataFrame(s_y_pred).plot(ax=ax)
pd.DataFrame(non_s_y_pred).plot(ax=ax)

plt.title('National: Guns sold based on month')
plt.ylabel('Guns Sold per 100,000')
plt.xlabel('Monthly from 2013-2017')
plt.legend(['actual guns sold','combination rf predictions guns sold','season only rf predictions guns sold','shooting only rf predictions guns sold'])
plt.show()
In [75]:
# saves state prediction csvs
result_states = pd.DataFrame(columns=['prediction','actual','state','month'])
for st in range(51):
    state_name = le.inverse_transform([st])
    temp = df[df.state == st]
    
    state_X_train, state_X_test, state_y_train, state_y_test  = sk.model_selection.train_test_split(temp[features],temp[testing],train_size=.8,random_state=10)
    
    state_X_training = sk.preprocessing.scale(state_X_train)
    state_X_testing = sk.preprocessing.scale(state_X_test)
    
    state_y_pred = rf.predict(state_X_testing)
    temp_save = pd.DataFrame(state_y_pred).join(state_y_test.reset_index(drop=True))
    temp_save['state']=state_name[0]
    months = state_X_train.month.reset_index(drop=True)
    years = state_X_train.year.reset_index(drop=True)
    temp_save['month'] = months + (years-2013)*12
    temp_save.columns = ['prediction','actual','state','month']
    result_states = result_states.append(temp_save,ignore_index=True)


result_states.to_csv('./data/state_predictions.csv')
In [76]:
HTML('''<div class='tableauPlaceholder' id='viz1544066036206' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ac&#47;Actual_vs_prediction_comparison&#47;Sheet1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Actual_vs_prediction_comparison&#47;Sheet1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ac&#47;Actual_vs_prediction_comparison&#47;Sheet1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'> var divElement = document.getElementById('viz1544066036206'); var vizElement = divElement.getElementsByTagName('object')[0]; vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';vizElement.parentNode.insertBefore(scriptElement, vizElement);</script>''')
Out[76]:

Interpretation and Analysis

Through modeling and trying to predict gun sales (represented by the number of firearm background checks) we evaluated the importance of mass shootings. After finding our most accurate model, then tuning it, we compared the model's predictions to the actual number of gun sales. We then removed the mass shooting data from our model and again tested it. We found that with the mass shooting data our model had a mean squared error of 21006~, while our model without the mass shooting data had a mean squared error of 11378~. We also ran a model without date related information to remove seasonality and found a mean squared error of 24961~. This shows that seasonality is the best predictor of gun sales, shooting datas mse seems to be just double, but looking at the trends we can see that the trend line just seems to average, with a few accurate predictions. A big problem is that our train and test data is limited to 59 months, so we don't have enough data points to know for certain if this is coincidental or an actual predictor.