Translation Mining

The methodology used in BASTA for a cross-linguistic analysis of tense and aspect is Translation Mining, a technique developed at Utrecht University (van der Klis, Le Bruyn, de Swart 2017) that requires sentence-aligned multilingual corpora.

Translation Mining consists in five steps:

1 – Corpus creation

The corpus consists of the English original of Harry Potter and the Philosopher’s Stone and its translations to Slavic and Baltic languages (Belarusian, Bulgarian, Croatian, Czech, Latvian, Lithuanian, Macedonian, Polish, Russian, Serbian, Slovak, Slovenian, Ukrainian). The books are converted to digital text via optical character recognition and then manually checked for spelling errors and aligned on paragraph level with the English original. Texts are divided into narrative fragments and dialogues by means of a Python-based script and tagged for speech parts, using Spacy and Stanza. Finally, texts are aligned at sentence and word level, using Uplug (Tiedemann, 2003).

2 – Extraction of tense and aspect forms

The extraction of relevant forms is done in the web application PreSelect. Annotators have to select the verb phrases in a randomly chosen fragment and assign tense and aspect values.

3 – Cross-linguistic alignment

The selected forms are manually matched with their counterparts in the other languages. This is done through the web-based software TimeAlign (https://translation-mining.basta.uwr.edu.pl/), which allows to choose matching forms by clicking on the relevant words in the translation. This step results in the creation of aligned sets of parallel forms from all the languages under analysis.

4 – Analysis of the data

Data is analyzed by creating a dissimilation matrix for the sets of parallel forms obtained in step 3. Through a distance function, sets are defined as similar if all the tense attributions in the languages match. We assign a value of 0 to such sets. The value grows in sets in which a mismatch is present: 1 is added for each mismatch in the set (i.e.: a set in which two forms do not match will be assigned a value of 2). The distance function is used to create a dissimilarity matrix, represented as a table.

5 – Visualization

The matrix obtained in step 4 is plotted using multidimensional scaling, with the algorithm from the scikit-learn package (Pedregosa et al., 2011) in Python, and visualized with the nvd3 package (http://nvd3.org/). The visualization consists of a map, showing the use of tenses and aspects in the different languages and allowing for a cross-linguistic comparison.

Books in our corpus

Belarusian: Гары Потэр і Філасофскі Камень, 2022, ed. Yanushkevich; translator: A. Piatrovich
Bulgarian: Хари Потър и философският камък, 2000, ed. Egmont; translator: T. Dzhebanova
Croatian: Harry Potter i kamen mudraca, 2000, ed. Algoritam; translator: Z. Crnković
Czech: Harry Potter a kámen mudrců, 2000, ed. Albatros Media; translator: V. Medek
English: Harry Potter and the Philosopher’s Stone, 2012, ed. Pottermore Limited; author: J. K. Rowling
Latvian: Harijs Poters un Filozofu akmens, 2021, ed. Zvaigzne ABC; translator: I. Josts
Lithuanian: Haris Poteris ir Išminties akmuo, 2020, ed. Alma littera; translator: Z. Marienė
Macedonian: Xари Потер и Каменот на мудроста, 2020 ed. Ars Lamina; translator: V. Stojanovski
Polish: Harry Potter i kamień filozoficzny, 2000, ed. Media Rodzina; translator: A. Polkowski
Russian: Гарри Поттер и философский камень, 2022, ed. Machaon, Azbuka-Attikus; translator: M. Spivak
Serbian: Hari Poter i kamen mudrosti, 2008, ed. Evro-Giunti; translators: V. Roganović, D. Roganović
Slovak: Harry Potter a kameň mudrcov, 2015, ed. IKAR; translator: J. Petrikovičová
Slovenian: Harry Potter – Kamen modrosti, 2017, ed. Mladinska knjiga; translator: J. J. Kenda
Ukrainian: Гаррi Поттер i фiлософський камiнь, 2002, ed. A-BA-BA-GA-LA-MA-GA; translator: V. Morozov

Tutorials

Paragraph Alignment on Notepad++
Corpora Preprocessing