NLP Workshop: Prototyping Discovery: An AI Platform to Test Processing 3.8 Million Archival Documents

Thursday, February 12, 2026

17:30h

Presented by

Haedar Raad Hadi (Hoover Institution - Stanford University)

https://www.hoover.org/profiles/haidar-hadi

Prototyping Discovery: An AI Platform to Test Processing 3.8 Million Archival Documents

Abstract

3.8 million documents that have remained inaccessible for over 17 years due to a lack of funding and staff resources. Traditional transcription approaches would require decades of manual labor. To evaluate whether AI can transform this challenge, we have developed a pilot web application to test and validate this approach. This presentation demonstrates our prototype platform that combines multiple AI models (Claude, GPT-4, Gemini), a learning RAG system, and an intuitive workspace for archivists. Our goal with this pilot is to test whether 70-80% AI accuracy is sufficient to create searchable descriptions and enable digital discovery. We are now ready to conduct sample testing to evaluate if this approach can help dark archives become discoverable, allowing researchers to access previously hidden historical materials without requiring proportional funding increases.

About this workshop

The aim of this workshop is to promote technical and practical exchanges between researchers who use NLP methods. There is no hesitation in detailing the code (r/python), sharing tips, and discovering new methods and models.

Periodicity: Thursdays from 12h15 to 13h30, by videoconference.

Back to all NLP sessions