Back to list

Detail of contribution

Auteur: Martin VOLK

Building a Multilingual Heritage Corpus. A Case Study in Digital Humanities

Abstract/Résumé: In the project Text+Berg we have built a large multilingual heritage corpus of alpine texts. In this presentation I will share our experiences and lessons learned. We have digitized and annotated all yearbooks of the Swiss Alpine Club (SAC) from 1864 until 2011. Texts include mountaineering reports and articles about the geology, biology and history of mountaineous regions around the globe. Our annotations comprise linguistic information (e.g. Part-of-speech tags and lemmas) but also person names and toponym classes (e.g. mountains, glaciers, lakes and cabins). The Text+Berg corpus currently consists of around 22 million tokens in both German and French, out of which about 5 million tokens are translations (i.e. parallel texts). In addition the corpus contains Italian, English and Romansh texts with less than 400,000 tokens each. The 150 year sequence of the digitised books provides new opportunities for linguistic research: it enables the quantitative analysis of diachronic language change as well as the study of typical language structures, linguistic topoi, and figures of speech. Moreover, the digital books allow for a quick access as the basis for the analysis of texts and pictures and for information retrieval purposes. We will argue that the Text+Berg project is a prototypical case of digital humanities with a large collections of heritage documents being structured and annotated for multi-purpose access and long-term storage.