Indicators of Deprivation Part 1 (See the code)
In this piece of analysis, I try to determine if there are any variables which can predict the Deprivation score of an area in England. I downloaded loads of data from the Office for National Statistics, and linked all the datasets together using the geography codes.
The variables of interest in this analysis are:
- Number of people in broad age categories (children, adults and pensioners)
- Number of people living in a communal establishment and number of communal establishments (a measure of homelessness)
- Number of people in broad ethnicity categories (7 ethnicity categories + 1 ‘Other’)
- Number of people in broad religion categories (Christianity, Buddhism, Hinduism, Islam, Jewish and Sikh + ‘Other Religion’, ‘No Religion’ and ‘Religion not stated’)
- Population density
The first part of this analysis deals with how to wrangle and manipulate the data into a format which can be easily analysed. I had to overcome a lot of problems when loading the data in from different sources (.csv, .xls etc.), and the solutions I found will hopefully be applicable to any analyses which you do! Please take and reuse my code as you see fit – you can fork it at my Github.
Some of the problems which I faced when wrangling this data:
- Duplicated records for each Local Area District (LAD). Each LAD is comprised of several smaller areas – some data sources only had information for the smaller areas and I had to take the average of these smaller areas to get a score for the LAD using the Pandas groupby function.
- Combining and summing columns. The age data came in 100 separate columns; far too many to run through a regression! I had to combine the columns into three broad categories, summing the number of people of each age in the particular range.
- Stripping out rows and columns with missing data. The Communal Living data (in particular) came in a very human-readable format which is unfortunately difficult to read in as a DataFrame. I used the Pandas drop and dropna functions an awful lot!
- .csv encoding. Stackoverflow provided a great answer which I used to read in a .csv file using a different encoding.
- Changing text to numbers and removing commas as thousand separators. This took me ages to realise what the problem was, and only slightly less time to solve it!
The second part of the analysis looks at how to actually analyse the data. I use the Sklearn, Statsmodels and SciPy libraries to do a multiple regression on the variables. Because of the large number of variables that could potentially be included, I wrote a function which runs the multiple regression on every combination of variables and ranks each combination based on several statistics for that model (R^2, p-value etc).
WARNING: This analysis gets pretty statisticky at times; I’ve tried to explain my processes and reasoning as clearly as possible, but let me know if you have any trouble understanding!
To do the data manipulation I used the Pandas library – this is the standard library for doing data analysis in Python; hopefully this tutorial will help you to learn how to use it!
The code for Part 1 is available here, and if you want to read ahead you can find the introduction for Part 2 here and the code here. I try to explain my analytical methods and thinking in plain English, and I really take the time to explain what each bit of my code does but if there’s anything that you don’t understand, send me an email or ask a question in the comments.