Mapping Great Britain’s semantic footprints through a large language model analysis of Reddit comments
Cillian Berragan; Alex Singleton; Alessia Calafiore; Jeremy Morley (2024). Computers, Environment and Urban Systems, 110, 102121. DOI: 10.1016/j.compenvurbsys.2024.102121
Abstract
Observed regional variation in geotagged social media text is often attributed to dialects, where features in language are assumed to exhibit region-specific properties. While dialects are seen as a key component in defining the identity of regions, there are a multitude of other geographic properties that may be captured within natural language text. In our work, we consider locational mentions that are directly embedded within comments on the social media website Reddit, providing a range of associated semantic information, and enabling deeper representations between locations to be captured. Using a large corpus of geoparsed Reddit comments from UK-related local discussion subreddits, we first extract embedded semantic information using a large language model, aggregated into local authority districts, representing the semantic footprint of these regions. These footprints broadly exhibit spatial autocorrelation, with clusters that conform with the national borders of Wales and Scotland. London, Wales, and Scotland also demonstrate notably different semantic footprints compared with the rest of Great Britain.
Extended Summary
This research investigates how semantic information in social media discussions reveals regional geographic identities across Great Britain, moving beyond traditional dialect analysis to examine deeper cultural and contextual associations with places. The study analysed over 850,000 Reddit comments from UK-related local discussion forums, using advanced natural language processing techniques to extract semantic meanings from text that mentioned specific locations. Rather than relying on geotagged posts that may not relate to their tagged location, the research focused on comments that explicitly mentioned place names, capturing what users actually say about different areas. The methodology employed large language models to generate ‘semantic footprints’ for each local authority district, creating numerical representations that capture the contextual meaning and cultural associations embedded in discussions about places. These semantic footprints were then analysed for spatial patterns using statistical techniques including Moran’s I analysis and hierarchical clustering to identify geographically coherent regions. The findings reveal significant spatial autocorrelation in semantic footprints, with three distinct clusters emerging that closely align with established administrative boundaries. Scotland and Wales form semantically coherent regions that are distinctly different from England, whilst London and its surrounding areas constitute a third unique semantic region. Interestingly, major cities in Scotland and Wales (Glasgow, Edinburgh, Cardiff) showed greater semantic similarity to English regions than to their respective countries, suggesting these urban centres share cultural connections that transcend national borders. The research employed zero-shot classification techniques to explore national identity markers in the text, finding that comments about Scottish and Welsh locations were more strongly associated with their respective national identities, whilst English regions showed stronger association with British identity overall. London demonstrated particularly high confidence values for both British and English identities compared to other regions. These semantic footprints capture vernacular geography - the informal, place-based knowledge that emerges from people’s lived experiences and cultural understanding of places. Unlike dialect studies that focus on vocabulary differences, this approach reveals deeper semantic associations that reflect cultural identities, shared experiences, and collective perceptions of different areas. The research demonstrates that social media discussions contain rich geographic information that maps onto real-world administrative and cultural boundaries, providing insights into how digital communities understand and represent regional identities across Britain.
Key Findings
- Semantic footprints from Reddit comments exhibit significant spatial autocorrelation, clustering along national boundaries of Wales and Scotland
- London emerges as a semantically distinct region separate from the rest of England, reflecting its unique cultural and economic position
- Major cities in Scotland and Wales show greater semantic similarity to English regions than their respective countries
- Welsh and Scottish locations demonstrate stronger national identity associations whilst English regions align more with British identity
- Social media discussions capture vernacular geography that maps meaningfully onto established administrative and cultural boundaries
Citation
@article{berragan2024mapping,
author = {Cillian Berragan; Alex Singleton; Alessia Calafiore; Jeremy Morley},
title = {Mapping Great Britain's semantic footprints through a large language model analysis of Reddit comments},
journal = {Computers, Environment and Urban Systems},
year = {2024},
volume = {110},
pages = {102121},
doi = {10.1016/j.compenvurbsys.2024.102121}
}