Artificial Intelligence reveals the secret texts of the Vatican

28 June 2018 | Written by La redazione

A new project, In Codice Ratio, wants to use the latest findings in Artificial Intelligence to read and transcribe the documents kept in the Vatican Secret Archives and make them available for research. With the help of a group of high school students.

Eighty-five kilometers of shelves containing the history of the last twelve centuries: going through them, you can find the papal bull that excommunicated Martin Luther, the request for help that Mary Queen of Scots sent to Pope Sixtus V before being executed, and even a request of annulment of marriage by Henry VIII. The Vatican Secret Archives is an inestimable treasure that risks, however, to remain inaccessible to most.

In fact, of the several kilometers of the Archive, only a very small part has been scanned and made available online, and even less material has been transcribed into computer text and made searchable. Basically, to consult these documents you need to travel to Rome, access the Vatican Secret Archives and browse every single page by hand.

Soon, however, a solution could be offered by a very innovative – don’t let the Latin name fool you – project: In Codice Ratio. The project was created in collaboration with a group of researchers at the University of Roma Tre and, if it were to succeed, it could be the solution for the transcription and usability of many other documents in historical archives all over the world. In Codice Ratio, in fact, wants to use artificial intelligence, combining digital image analysis techniques with the already existing optical character recognition software (OCR), to transcribe the texts of the Vatican Secret Archives, often very complex and difficult to read.

The OCR has been used for years to scan books and other printed materials, but isn’t, in fact, able to perform the same operation on handwritten texts, like most of the material contained in the Vatican Secret Archives.

The five researchers who coordinate In Codice Ratio – Paolo Merialdo, Donatella Firmani, Elena Nieddu, Serena Ammirati of the University of Roma Tre and Marco Maiorino of the Vatican Secret Archives – have managed to sidestep this problem thanks to an invention called “jigsaw segmentation“. The team of researchers explained that this system doesn’t separate words in letters but in segments more similar to single strokes of pen, thus allowing to break down the texts into limited portions, such as the pieces of a jigsaw.

To train the system to interpret the aforementioned portions of text, the team used an unusual as well as effective method: students from 24 Italian schools were recruited and, in the last few months, they helped the researchers to build a database containing thousands of examples of characters extrapolated from the manuscripts. Through an online application, and thanks to the support of a group of expert palaeographers, the students involved taught the software the shape of each of the characters of the medieval Latin alphabet. “Initially, the idea of involving high-school students was considered foolish,” said Merialdo, who coordinates In Codice Ratio, in an interview for The Atlantic. “But now the machine is learning thanks to their efforts. I like that a small and simple contribution by many people can indeed contribute to the solution of a complex problem”.

In order for the software to be able to transcribe the texts, besides being able to read the “jigsaw pieces”, that is the single letters that make up the words, the system has been taught a bit of practical intelligence, or common sense, thanks to the “ingestion” of over 1.5 million Latin tomes.

When tested, by making it transcribe some documents from the Vatican Registers, a more than 18,000-page subset of the Secret Archives consisting of letters to European kings, rulings on legal matters, and other correspondence, the system held up well: even though about a third of the transcribed words contained typos, the software managed to guess 96% of the manuscript letters.

A promising result, that, as explained by the historian of philosophy and paleographer Rega Wood, still has its limits: “The problems will arise for manuscripts that are not professionally written but copied by nonprofessionals”; also, in cases where there’s only a small sample size of material to work with, “it is not only more accurate, but just as quick to make transcriptions without such technology”.

It’s certain, anyway, that, like all artificial intelligences, the margins of improvement are very wide and In Codice Ratio will have the opportunity to improve over time. In the meantime, experts agree that the technology made available by the system, or the jigsaw segmentation combined with software training through crowdsourcing, can easily be adapted to read texts in other languages and other scripts.

We just have to wait and see what secrets In Codice Ratio will be able to unveil.