# Indicators of Deprivation Part 2 (See the code)

In this piece of analysis, I try to determine if there are any variables which can predict the Deprivation score of an area in England. I downloaded loads of data from the Office for National Statistics, and linked all the datasets together using the geography codes.

In Part 1 I collated the data from six different spreadsheets and manipulated and cleaned it into a format which can be easily analysed.

WARNING: This analysis gets pretty statisticky at times; I’ve tried to explain my processes and reasoning as clearly as possible, but let me know if you have any trouble understanding!

Part 2 focuses on actually analysing the data. First of all I trained a regression model on each variable against the average deprivation score for that area using the Sklearn library. This is a great machine learning library, although I barely scratch the surface here! From the function, I create a DataFrame which contains information about:

- the Regression Coefficient (the slope of the line – how much the Y variable changes for a given change in X)
- the Intercept (where the regression line crosses the Y-axis)
- the RSS (how well the line fits the data – lower values are better)
- the Variance (how spread out the data is)
- R-Squared (the percent of the variance that is explained by knowing the X variables – higher values are better)
- the P-value (helps to determine if the regression equation is statistically significant – lower values are better, under 0.05 is pretty good)
- and the Standard Error (helps to assess the precision of any predictions).

I used both the Sklearn and SciPy libraries to get the regression information, as each library provides only some of the information which I’m interested in. From running a linear regression model on the variables in each category (Age, Communal Living, Ethnicity, Religion and Population Density), I can decide which of the variables in each category I want to include in the final model. Reducing the number of variables is really important; I decided on 7 different variables which resulted in 127 different combinations – had I chosen an eighth variable this would have risen to 255 combinations, and a ninth would increase it to 511 combinations! I don’t have access to enough computing power to run that many models, and it also makes sense to inspect the explanatory variables manually to get a feel for how they interact with the response variable.

The next step was to create and apply the matrix of variable combinations. Both Sklearn and SciPy had problems with a multiple regression and I ended up using Statsmodels for this bit. With the information about each regression model, I could rank each of the 127 variable combinations by their ‘goodness’. I ranked the models on:

- RSS
- R-Squared
- P-value
- Standard Error

The model which had the lowest combined rank for each of these measures is:

coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|

Lives in a household | 8.305e-05 | 1.19e-05 | 6.975 | 0.000 | 5.96e-05 0.000 |

Black/African/Caribbean/Black British: African | 0.0002 | 0.000 | 1.789 | 0.074 | -1.88e-05 0.000 |

Other ethnic group: Any other ethnic group | -0.0002 | 4.16e-05 | -4.303 | 0.000 | -0.000 -9.71e-05 |

No religion: Total | -8.921e-05 | 3.99e-05 | -2.235 | 0.026 | -0.000 -1.07e-05 |

Religion not stated | -0.0004 | 0.000 | -2.100 | 0.037 | -0.001 -2.4e-05 |

Density | 0.2641 | 0.028 | 9.453 | 0.000 | 0.209 0.319 |

const | 12.2105 | 0.736 | 16.594 | 0.000 | 10.763 13.658 |

I hope you’ve found this analysis interesting and helpful. The introduction for Part 1 is available here and the code is here if you’d like to refresh your memory of how I actually got the data.

I try to explain my analytical methods and thinking in plain English, and I really take the time to explain what each bit of my code does but if there’s anything that you don’t understand, send me an email or ask a question in the comments.

Very descriptive article, I liked that a lot.

Will there be a part 2?