More ancient texts are discovered throughout the Near East every year in both the Hebrew and Aramaic languages. Analyzing these texts is extremely important for researchers studying the culture and history of the region.
Since many of these inscriptions are damaged over time due to earthquakes, fires, political conflicts, and other natural and human-related causes Epigraphists – experts who are responsible for reconstructing, translating, and dating inscriptions and finding any relevant circumstance, leaving it to historians to determine and interpret the recorded events – have until now used time-consuming manual procedures to estimate the missing content. This has been a major challenge in reconstructing the missing parts of these valuable writings.
Now, students in the software and information systems engineering department at Ben-Gurion University of the Negev (BGU) in Beersheba have approached this challenge as an extended masked language modeling task where the damaged content can comprise single characters, character n-grams (partial words), single complete words, and multi-word n-grams. This study is the first attempt to apply the masked language modeling approach to corrupted inscriptions in Hebrew and Aramaic languages, both using the Hebrew alphabet consisting mostly of consonant symbols.In their final project under the supervision of Prof. Mark Last; and fourth-year undergraduate students Niv Fono, Harel Moshayof, Eldar Karol, and Itay Asraf applied the masked language modeling approach to corrupted inscriptions in Hebrew and Aramaic.
Their model, called “Embible,” was highlighted at the latest meeting of the European Chapter of the Association for Computational Linguistics last month. They published their findings in the journal ACL Anthology under the title “Embible: Reconstruction of Ancient Hebrew and Aramaic Texts Using Transformers.”
The system analyzed thousands of sentences from the Jewish Bible
The students trained the system on 22,144 sentences from the Hebrew Bible. The system was tested on the other 536 sentences with significant success. An ensemble of word and character prediction models had the highest accuracy.“We can help historians who have devoted their lives to recreating these ancient texts as accurately as possible,” he concluded. Last, “Furthermore, I believe the model can be extended to cover other morphologically rich ancient languages.”