A Principal Component Analysis (PCA)-based framework for automated variable selection in geodemographic classification

Author

Yunzhe Liu; Alex Singleton; Daniel Arribas-Bel

Published

October 2, 2019

Yunzhe Liu; Alex Singleton; Daniel Arribas-Bel (2019). Geo-spatial Information Science, 22(4), 251-264. DOI: 10.1080/10095020.2019.1621549

Abstract

A geodemographic classification aims to describe the most salient characteristics of a small area zonal geography. However, such representations are influenced by the methodological choices made during their construction. Of particular debate are the choice and specification of input variables, with the objective of identifying inputs that add value but also aim for model parsimony. Within this context, our paper introduces a principal component analysis (PCA)-based automated variable selection methodology that has the objective of identifying candidate inputs to a geodemographic classification from a collection of variables. The proposed methodology is exemplified in the context of variables from the UK 2011 Census, and its output compared to the Office for National Statistics 2011 Output Area Classification (2011 OAC). Through the implementation of the proposed methodology, the quality of the cluster assignment was improved relative to 2011 OAC, manifested by a lower total within-cluster sum of square score. Across the UK, more than 70.2% of the Output Areas (OAs) occupied by the newly created classification (i.e. AVS-OAC) outperform the 2011 OAC, with particularly strong performance within Scotland and Wales.

Extended Summary

This research develops an automated method to select the most relevant variables for creating neighbourhood classifications (geodemographics) using Principal Component Analysis, addressing a key challenge in how to choose which data should be used when grouping small areas by their characteristics. Geodemographic classifications organise small geographical areas into clusters that share similar socio-economic, demographic, and housing characteristics, based on the principle that people with similar attributes tend to live in similar neighbourhoods. The traditional approach to selecting variables for these classifications relies heavily on manual expert judgement and stakeholder consultation, which can be time-consuming and subjective. The paper presents a five-stage automated methodology that uses Principal Component Analysis to identify the most important variables from a large pool of candidates. The methodology first generates principal components from input variables, then defines threshold values for testing, iteratively removes components whilst quantifying variable contributions, examines correlations between retained variables using minimum spanning trees, and finally clusters the filtered variables using K-means algorithms. The research tested this approach using 171 variables from the 2011 UK Census, ultimately selecting 74 variables to create a new classification system called AVS-OAC (Automated Variable Selection Output Area Classification). When compared against the official 2011 Output Area Classification produced by the Office for National Statistics, the automated method demonstrated superior performance. The AVS-OAC achieved lower total within-cluster sum of squares scores, indicating that areas within each cluster were more similar to each other than in the official classification. Across the UK, over 70% of output areas were better represented by the automated classification than the official version, with particularly strong performance in Scotland and Wales. The new classification retained 40 demographic variables, 24 socio-economic variables, and 10 housing variables, showing a higher proportion of demographic characteristics compared to the official classification. The research demonstrates that automated variable selection can produce geodemographic classifications that are statistically superior to expert-driven approaches whilst maintaining interpretability and practical utility. This methodology offers significant advantages for future census-based classifications, potentially reducing the time and resources required for variable selection whilst improving statistical performance. The approach is particularly valuable given the increasing volume of available data and the computational challenges of processing high-dimensional datasets in geodemographic analysis.

Key Findings

Automated PCA-based variable selection produces geodemographic classifications with superior statistical performance compared to expert-driven manual selection methods.
The automated methodology selected 74 variables from 171 candidates, outperforming the official 2011 Output Area Classification in over 70% of UK areas.
Scotland and Wales showed particularly strong performance improvements, with all Scottish unitary authorities achieving over 70% better area representation.
The approach demonstrates that computational methods can maintain interpretability whilst reducing reliance on subjective expert judgement in neighbourhood classification.
Lower total within-cluster sum of squares scores indicate that automated selection creates more homogeneous and statistically coherent geographical clusters.

Citation

PDF Download BibTeX

@article{liu2019principal,
  author = {Yunzhe Liu; Alex Singleton; Daniel Arribas-Bel},
  title = {A Principal Component Analysis (PCA)-based framework for automated variable selection in geodemographic classification},
  journal = {Geo-spatial Information Science},
  year = {2019},
  volume = {22(4)},
  pages = {251-264},
  doi = {10.1080/10095020.2019.1621549}
}