Transformer based named entity recognition for place name extraction from unstructured text

Author

Cillian Berragan; Alex Singleton; Alessia Calafiore; Jeremy Morley

Published

April 3, 2023

Cillian Berragan; Alex Singleton; Alessia Calafiore; Jeremy Morley (2023). International Journal of Geographical Information Science, 37(4), 747-766. DOI: 10.1080/13658816.2022.2133125

Abstract

Place names embedded in online natural language text present a useful source of geographic information. Despite this, many methods for the extraction of place names from text use pre-trained models that were not explicitly designed for this task. Our paper builds five custom-built Named Entity Recognition (NER) models and evaluates them against three popular pre-built models for place name extraction. The models are evaluated using a set of manually annotated Wikipedia articles with reference to the F1 score metric. Our best performing model achieves an F1 score of 0.939 compared with 0.730 for the best performing pre-built model. Our model is then used to extract all place names from Wikipedia articles in Great Britain, demonstrating the ability to more accurately capture unknown place names from volunteered sources of online geographic information.

Extended Summary

This research investigates how to improve the extraction of place names from unstructured online text by developing custom named entity recognition models specifically designed for geographical applications. Existing pre-built natural language processing models were not originally designed for place name extraction, often failing to identify informal or hyper-localised place references that don’t appear in official administrative databases. The study addresses this limitation by creating task-specific models trained on geographically-focused data rather than general-purpose text corpora. The research methodology involved collecting 42,222 Wikipedia abstracts for geographical locations in Great Britain using DBpedia queries. From this corpus, 200 articles were manually annotated to identify place names, with tokens labelled using a specialised annotation scheme that focuses exclusively on geographic place references rather than the broader entity categories used by general NER models. Five custom models were developed using different architectures, including traditional bidirectional LSTM networks and modern transformer-based approaches (BERT, RoBERTa, and DistilBERT). These models were trained using the AllenNLP framework and evaluated against three popular pre-built NER systems: SpaCy and Stanza models commonly used in existing geoparsing applications. The evaluation used standard precision, recall, and F1 score metrics on held-out test data. The findings demonstrate significant improvements in place name extraction accuracy. The best-performing custom model (BERT-based) achieved an F1 score of 0.939, substantially outperforming the best pre-built model (Stanza) which scored 0.730. All transformer-based custom models showed statistically significant improvements over existing solutions. The custom models particularly excelled at recall, identifying place names that pre-built systems missed due to their focus on broader entity categories like ‘geopolitical entities’ or ‘locations’ rather than specific place name recognition. When applied to the full Wikipedia corpus, the DistilBERT model extracted 614,672 place names, including 99,697 unique references. Notably, 62,178 unique place names were identified that don’t appear in the GeoNames gazetteer, including granular references like road names, alternative place names, and informal geographic references commonly used in natural language but absent from official databases. The research demonstrates the importance of task-specific model development in geographic information extraction. By moving away from general-purpose annotation schemes towards place-focused training data, the study shows how volunteered geographic information sources like Wikipedia can be more effectively mined for geographic knowledge. This approach has significant implications for improving geoparsing systems, enriching geographic databases with vernacular place names, and enhancing location-based services that rely on understanding geographic references in natural language text.

Key Findings

  • Custom-built named entity recognition models achieve F1 score of 0.939 versus 0.730 for best pre-built model in place name extraction
  • Task-specific training data significantly outperforms general-purpose models, particularly improving recall of geographic place references
  • Wikipedia corpus analysis reveals 62,178 unique place names absent from GeoNames gazetteer, including informal and granular geographic references
  • Transformer-based models (BERT, RoBERTa, DistilBERT) demonstrate superior performance over traditional bidirectional LSTM architectures for geographic text processing
  • Place-focused annotation schemes prove more effective than general entity recognition categories for extracting geographic information from natural language

Citation

PDF Download BibTeX

@article{berragan2023transformer,
  author = {Cillian Berragan; Alex Singleton; Alessia Calafiore; Jeremy Morley},
  title = {Transformer based named entity recognition for place name extraction from unstructured text},
  journal = {International Journal of Geographical Information Science},
  year = {2023},
  volume = {37(4)},
  pages = {747-766},
  doi = {10.1080/13658816.2022.2133125}
}