Language Modeling for Epigraphs: a BERT model for EDR Latin Epigraphs text completion

Published in IEEE CyberHumanities, 2025

The Epigraphic Database Roma (EDR) stands as the most comprehensive and precise collection of Ancient Roman inscriptions, boasting over one hundred thousand entries curated by the International Federation of Epigraphic Databases. Given that the dating of these inscriptions span across centuries, many have suffered from erosion, resulting in missing text. Our objective is to reconstruct these lost segments. To achieve this, we plan to fine-tune LatinBERT, the leading language model for Latin, using the EDR database. This process will yield a specialized language model adept at filling in the gaps within these ancient texts. This advanced model represents a stepping stone in language models trained on inscriptions.