Indicators of Deprivation (2)

Indicators of Deprivation Part 2 (See the code)

In this piece of analysis, I try to determine if there are any variables which can predict the Deprivation score of an area in England. I downloaded loads of data from the Office for National Statistics, and linked all the datasets together using the geography codes.

In Part 1 I collated the data from six different spreadsheets and manipulated and cleaned it into a format which can be easily analysed.

 

WARNING: This analysis gets pretty statisticky at times; I’ve tried to explain my processes and reasoning as clearly as possible, but let me know if you have any trouble understanding!

Part 2 focuses on actually analysing the data. First of all I trained a regression model on each variable against the average deprivation score for that area using the Sklearn library. This is a great machine learning library, although I barely scratch the surface here! From the function, I create a DataFrame which contains information about:

  • the Regression Coefficient (the slope of the line – how much the Y variable changes for a given change in X)
  • the Intercept (where the regression line crosses the Y-axis)
  • the RSS (how well the line fits the data – lower values are better)
  • the Variance (how spread out the data is)
  • R-Squared (the percent of the variance that is explained by knowing the X variables – higher values are better)
  • the P-value (helps to determine if the regression equation is statistically significant – lower values are better, under 0.05 is pretty good)
  • and the Standard Error (helps to assess the precision of any predictions).

 

I used both the Sklearn and SciPy libraries to get the regression information, as each library provides only some of the information which I’m interested in. From running a linear regression model on the variables in each category (Age, Communal Living, Ethnicity, Religion and Population Density), I can decide which of the variables in each category I want to include in the final model. Reducing the number of variables is really important; I decided on 7 different variables which resulted in 127 different combinations – had I chosen an eighth variable this would have risen to 255 combinations, and a ninth would increase it to 511 combinations! I don’t have access to enough computing power to run that many models, and it also makes sense to inspect the explanatory variables manually to get a feel for how they interact with the response variable.

 

The next step was to create and apply the matrix of variable combinations. Both Sklearn and SciPy had problems with a multiple regression and I ended up using Statsmodels for this bit. With the information about each regression model, I could rank each of the 127 variable combinations by their ‘goodness’. I ranked the models on:

  • RSS
  • R-Squared
  • P-value
  • Standard Error

The model which had the lowest combined rank for each of these measures is:

coef std err t P>|t| [95.0% Conf. Int.]
Lives in a household 8.305e-05 1.19e-05 6.975 0.000 5.96e-05 0.000
Black/African/Caribbean/Black British: African 0.0002 0.000 1.789 0.074 -1.88e-05 0.000
Other ethnic group: Any other ethnic group -0.0002 4.16e-05 -4.303 0.000 -0.000 -9.71e-05
No religion: Total -8.921e-05 3.99e-05 -2.235 0.026 -0.000 -1.07e-05
Religion not stated -0.0004 0.000 -2.100 0.037 -0.001 -2.4e-05
Density 0.2641 0.028 9.453 0.000 0.209 0.319
const 12.2105 0.736 16.594 0.000 10.763 13.658

 

 

I hope you’ve found this analysis interesting and helpful. The introduction for Part 1 is available here and the code is here if you’d like to refresh your memory of how I actually got the data.

I try to explain my analytical methods and thinking in plain English, and I really take the time to explain what each bit of my code does but if there’s anything that you don’t understand, send me an email or ask a question in the comments.

 

Follow me on TwitterGithub and Plotly, add me on LinkedIn and visit my Website.

One thought on “Indicators of Deprivation (2)

Leave a Reply

Your email address will not be published. Required fields are marked *