Indicators of Deprivation Part 2 (See the code)
In this piece of analysis, I try to determine if there are any variables which can predict the Deprivation score of an area in England. I downloaded loads of data from the Office for National Statistics, and linked all the datasets together using the geography codes.
In Part 1 I collated the data from six different spreadsheets and manipulated and cleaned it into a format which can be easily analysed.
WARNING: This analysis gets pretty statisticky at times; I’ve tried to explain my processes and reasoning as clearly as possible, but let me know if you have any trouble understanding!
Part 2 focuses on actually analysing the data. First of all I trained a regression model on each variable against the average deprivation score for that area using the Sklearn library. This is a great machine learning library, although I barely scratch the surface here! From the function, I create a DataFrame which contains information about:
- the Regression Coefficient (the slope of the line – how much the Y variable changes for a given change in X)
- the Intercept (where the regression line crosses the Y-axis)
- the RSS (how well the line fits the data – lower values are better)
- the Variance (how spread out the data is)
- R-Squared (the percent of the variance that is explained by knowing the X variables – higher values are better)
- the P-value (helps to determine if the regression equation is statistically significant – lower values are better, under 0.05 is pretty good)
- and the Standard Error (helps to assess the precision of any predictions).
I used both the Sklearn and SciPy libraries to get the regression information, as each library provides only some of the information which I’m interested in. From running a linear regression model on the variables in each category (Age, Communal Living, Ethnicity, Religion and Population Density), I can decide which of the variables in each category I want to include in the final model. Reducing the number of variables is really important; I decided on 7 different variables which resulted in 127 different combinations – had I chosen an eighth variable this would have risen to 255 combinations, and a ninth would increase it to 511 combinations! I don’t have access to enough computing power to run that many models, and it also makes sense to inspect the explanatory variables manually to get a feel for how they interact with the response variable.
The next step was to create and apply the matrix of variable combinations. Both Sklearn and SciPy had problems with a multiple regression and I ended up using Statsmodels for this bit. With the information about each regression model, I could rank each of the 127 variable combinations by their ‘goodness’. I ranked the models on:
- Standard Error
The model which had the lowest combined rank for each of these measures is:
|coef||std err||t||P>|t|||[95.0% Conf. Int.]|
|Lives in a household||8.305e-05||1.19e-05||6.975||0.000||5.96e-05 0.000|
|Black/African/Caribbean/Black British: African||0.0002||0.000||1.789||0.074||-1.88e-05 0.000|
|Other ethnic group: Any other ethnic group||-0.0002||4.16e-05||-4.303||0.000||-0.000 -9.71e-05|
|No religion: Total||-8.921e-05||3.99e-05||-2.235||0.026||-0.000 -1.07e-05|
|Religion not stated||-0.0004||0.000||-2.100||0.037||-0.001 -2.4e-05|
I try to explain my analytical methods and thinking in plain English, and I really take the time to explain what each bit of my code does but if there’s anything that you don’t understand, send me an email or ask a question in the comments.