Segmentation using large language models: A new typology of American neighborhoods

Author

Alex D. Singleton; Seth Spielman

Published

April 22, 2024

Alex D. Singleton; Seth Spielman (2024). EPJ Data Science, 13(1). DOI: 10.1140/epjds/s13688-024-00466-1

Abstract

In the United States, recent changes to the National Statistical System have amplified the geographic-demographic resolution trade-off. That is, when working with demographic and economic data from the American Community Survey, as one zooms in geographically one loses resolution demographically due to very large margins of error. In this paper, we present a solution to this problem in the form of an AI based open and reproducible geodemographic classification system for the United States using small area estimates from the American Community Survey (ACS). We employ a partitioning clustering algorithm to a range of socio-economic, demographic, and built environment variables. Our approach utilizes an open source software pipeline that ensures adaptability to future data updates. A key innovation is the integration of GPT4, a state-of-the-art large language model, to generate intuitive cluster descriptions and names. This represents a novel application of natural language processing in geodemographic research and showcases the potential for human-AI collaboration within the geospatial domain.

Extended Summary

This research develops an innovative artificial intelligence-based geodemographic classification system to address critical data quality problems in the American Community Survey (ACS). The study tackles the fundamental paradox whereby detailed geographic data becomes less reliable demographically due to large margins of error - in 72% of areas, the margin of error around poverty estimates exceeds the estimate itself. The methodology employs k-means clustering algorithms applied to 247 socio-economic, demographic, and built environment variables from over 200,000 US Census Block Groups. These variables were organised within a conceptual framework covering economy, environment, and population domains, with rigorous preprocessing including correlation analysis and constrained logit transformation. A key innovation involves integrating GPT-4 large language model technology through retrieval augmented generation to automatically create intuitive neighbourhood descriptions and names, representing the first application of natural language processing in academic geodemographic research. The research produced a comprehensive two-tier classification system organising America into 7 major Groups (A-G) subdivided into 39 specific Types, ranging from ‘Commuting Families’ to ‘Urban Melting Pot’. Each classification includes detailed pen portraits describing neighbourhood characteristics, demographics, housing patterns, and employment profiles. Validation through internal consistency measures and external evaluation using national eviction data demonstrated the system’s effectiveness in capturing meaningful spatial patterns and predicting real-world phenomena. The eviction analysis revealed significant variations across neighbourhood types, with some areas showing eviction rates substantially higher than others, illustrating the classification’s utility for policy analysis. This open-source approach addresses limitations of existing commercial geodemographic systems by providing transparent, reproducible methods accessible to researchers and policymakers. The integration of AI automation significantly reduces the time-consuming manual process traditionally required for creating neighbourhood descriptions whilst maintaining accuracy through human oversight. The research demonstrates how combining multiple noisy but unbiased variables can produce more reliable insights than individual measures, supporting a shift from single-variable approaches to contextual, multidimensional neighbourhood analysis. This methodology offers particular value for understanding complex social phenomena like housing instability, poverty, and demographic change across diverse American communities, providing essential tools for evidence-based policy development and social research.

Key Findings

  • GPT-4 integration enables automated generation of accurate neighbourhood descriptions, reducing manual classification time whilst maintaining 96% factual accuracy
  • The classification system organises 200,000+ US neighbourhoods into 7 Groups and 39 Types using 247 demographic and environmental variables
  • Eviction analysis validates the system’s utility, revealing significant neighbourhood-level variations in housing instability patterns across different demographic contexts
  • Open-source methodology addresses commercial system limitations by providing transparent, reproducible geodemographic analysis accessible to researchers and policymakers
  • Multidimensional approach overcomes American Community Survey data quality issues by aggregating noisy signals into reliable neighbourhood insights

Citation

PDF Download BibTeX

@article{singleton2024segmentation,
  author = {Alex D. Singleton; Seth Spielman},
  title = {Segmentation using large language models: A new typology of American neighborhoods},
  journal = {EPJ Data Science},
  year = {2024},
  volume = {13(1)},
  doi = {10.1140/epjds/s13688-024-00466-1}
}