EMNLP 2025

November 06, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Digitization is essential yet challenging for preserving historical heritage. This paper investigates the potential and limitations of large multimodal models (LMMs) in historical document digitization. Despite having advanced text recognition, LMMs today primarily focus on contemporary documents with modern layouts and high-resource languages. To bridge this gap, we introduce CHURRO, a large unified benchmark and dataset comprising over 150 historical corpora. CHURRO includes 100,471 pages spanning 22 centuries of textual heritage, covering handwritten and printed documents from 48 language clusters, historical language variants, and dead languages such as Latin and Sanskrit. The dataset features diverse layouts representative of real-world archival scenarios. We evaluate state-of-the-art LMMs and OCR systems on CHURRO and find that all models struggle with historical documents. Gemini 2.5 Pro is by far the best-performing model, yet it achieves only 78.7% and 67.5% normalized Levenshtein similarity on printed and handwritten documents, respectively. Importantly, fine-tuning a 3-billion-parameter multimodal model on CHURRO improves its performance substantially, by 12.4% (printed) and 25.4% (handwritten), attaining a performance of 75.9% (printed) and 66.7% (handwritten), respectively. These results highlight the untapped potential of targeted fine-tuning for historical document digitization.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

The Sound of Syntax: Finetuning and Comprehensive Evaluation of Language Models for Speech Pathology
poster

The Sound of Syntax: Finetuning and Comprehensive Evaluation of Language Models for Speech Pathology

EMNLP 2025

+3Duc Nguyen
Nick Haber and 5 other authors

06 November 2025