In Codice Ratio is a research project to study techniques for analyzing the contents of historical documents conserved in the Vatican Apostolic Archives. In this paper, we present our efforts to develop a system to support the automatic transcription of medieval manuscripts, while maintaining the training data collection effort minimal. We focus on crowdsourcing as a means for scalable, expertless training data collection: using crowdsourced character symbols, we train a custom convolutional neural network able to jointly learn correct character shape identification and character recognition. Our approach generates candidate transcriptions by submitting over-segmented character strokes and their combinations to this classifier, while ranking and choosing the most promising outputs by combining the recognition confidence with character and word level statistical language models. We conducted experiments on an unreleased corpus, the Vatican Registers: training our system on 20 pages annotated by the crowd, we were able to obtain good results (19% CER); comparisons to an off-the-shelf system trained with 20 pages annotated with the same process, and to a professional system trained with more than 300 pages transcribed by skilled paleographers demonstrate the opportunities of the proposed approach.

In Codice Ratio: A crowd-enabled solution for low resource machine transcription of the Vatican Registers / Nieddu, E.; Firmani, D.; Merialdo, P.; Maiorino, M.. - In: INFORMATION PROCESSING & MANAGEMENT. - ISSN 0306-4573. - 58:5(2021), pp. 1-20. [10.1016/j.ipm.2021.102606]

In Codice Ratio: A crowd-enabled solution for low resource machine transcription of the Vatican Registers

Nieddu E.
;
Firmani D.
;
2021

Abstract

In Codice Ratio is a research project to study techniques for analyzing the contents of historical documents conserved in the Vatican Apostolic Archives. In this paper, we present our efforts to develop a system to support the automatic transcription of medieval manuscripts, while maintaining the training data collection effort minimal. We focus on crowdsourcing as a means for scalable, expertless training data collection: using crowdsourced character symbols, we train a custom convolutional neural network able to jointly learn correct character shape identification and character recognition. Our approach generates candidate transcriptions by submitting over-segmented character strokes and their combinations to this classifier, while ranking and choosing the most promising outputs by combining the recognition confidence with character and word level statistical language models. We conducted experiments on an unreleased corpus, the Vatican Registers: training our system on 20 pages annotated by the crowd, we were able to obtain good results (19% CER); comparisons to an off-the-shelf system trained with 20 pages annotated with the same process, and to a professional system trained with more than 300 pages transcribed by skilled paleographers demonstrate the opportunities of the proposed approach.
2021
OCR; digital libraries; handwriting recognition
01 Pubblicazione su rivista::01a Articolo in rivista
In Codice Ratio: A crowd-enabled solution for low resource machine transcription of the Vatican Registers / Nieddu, E.; Firmani, D.; Merialdo, P.; Maiorino, M.. - In: INFORMATION PROCESSING & MANAGEMENT. - ISSN 0306-4573. - 58:5(2021), pp. 1-20. [10.1016/j.ipm.2021.102606]
File allegati a questo prodotto
File Dimensione Formato  
Nieddu_Codice-ratio_2021.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 2.64 MB
Formato Adobe PDF
2.64 MB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1638654
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 6
  • ???jsp.display-item.citation.isi??? 3
social impact