The Linguist

The Linguist is a languages magazine for professional linguists, translators, interpreters, language professionals, language teachers, trainers, students and academics with articles on translation, interpreting, business, government, technology

Page 21 of 35

22 The Linguist Vol/58 No/6 2019 ciol.org.uk/tl Do measures that pit machine translation against human interpreters give an accurate picture, asks Erwin La Cruz T he fight for translation supremacy between machines and humans rages on. The reports from the front line are delivered in code: LEPOR 0.68, NIST 0.56, BLEU 0.46, METEOR 0.56, WER 0.39. These acronyms are translation quality evaluation metrics. These metrics, and the algorithms that calculate them, are an integral part of the MT (machine translation) technology, making it possible to evaluate and refine the output of MT systems. But could this same technology be applied to assess interpreting students? Before we can answer that question, we need to understand what translation quality metrics are. Training and evaluating MT systems require the use of big data. Neural-network translation models are trained using huge collections of sentences and their translations. The output of these systems is so large that manually assessing its quality is not practical in terms of cost and time. Researchers have solved this problem by developing algorithms that can automatically calculate translation quality metrics. These metrics assume that quality can be evaluated by focusing on two aspects: accuracy and fluency. Accuracy refers to lexical equivalence between the original and its translation, while fluency is determined by how grammatical a translated sentence is. In 2018, Microsoft announced that its automatic translation system had achieved parity with human Chinese translators, reporting their MT achievement as matching human translations on a 0.69 BLEU score (where 0.7 is considered very good). 1 The Bilingual Evaluation Understudy (BLEU) score is one of the most common metrics used to evaluate MT quality. It is easy to compute and implement in different languages as it does not need language-specific parsers or synonym sets. BLEU scores also correlate closely with the ranking of translation quality by human assessors. As with other translation metrics, the basic unit of analysis for BLEU is the sentence. Taking the sentence hay un gato en la alfombra and its correspondent reference 'there is a cat on the mat' in an English-Spanish parallel corpus, how would the BLEU algorithm assess the translation candidate 'mat on is cat the'? First it takes into account how many words in the candidate appear in the reference – in this case five out of seven, which suggests an accurate translation. However, comparing individual words tells us little about how readable a sentence is. Although accurate in terms of lexical equivalence, a translation like 'mat on is cat the' is not fluid or grammatical. To account for fluidity – or grammatical adequacy – the BLEU algorithm calculates a higher score when longer word sequences match in the candidate and the reference sentences. The candidate 'the cat is on the mat' gets a higher score because it matches the longer sequence 'on the mat'. The BLEU score is calculated for individual sentences, but it is really meant to be used as a metric for the translation of a whole corpus. The final score for a given AI TAKEOVER? Microsoft claims that its machine translator has reached parity with human linguists based on the BLEU analysis, but does a high BLEU score really indicate a quality translation? LINGUIST vs MACHINE

The Linguist 58,6 - Dec/Jan2020

Contents of this Issue

Navigation

Page 21 of 35

Articles in this issue

Links on this page

Archives of this issue