The Linguist is a languages magazine for professional linguists, translators, interpreters, language professionals, language teachers, trainers, students and academics with articles on translation, interpreting, business, government, technology
Issue link: https://thelinguist.uberflip.com/i/1189092
FEATURES @Linguist_CIOL DECEMBER/JANUARY The Linguist 23 corpus is the mean score of all individual sentences. The scale goes from 0 to 1, with 1 meaning a perfect match between the candidate translation and reference sentences. In practical terms, a BLEU score of 0.70 is very good, while one below 0.20 means the translation is of no practical use. WHEN HIGH-SCORING VERSIONS FAIL I set out to investigate whether the BLEU score could be used as an effective assessment tool for interpreting students. My study involved 44 students of community interpreting, covering 18 languages, including majority languages such as Arabic, Japanese and Spanish, and languages of limited diffusion such as Nepali, Tongan and Kayan. The performance of each student was assessed by experienced professional interpreters of the students' languages, as they interpreted two dialogues on medical consultations of about 500 words each. Omissions, additions and distortions in their renditions were annotated on a copy of the original dialogue script. Syntactic changes were not recorded, so a candidate translation such as 'this pain is affecting my life' for the referent 'my life has been affected by this pain' was considered equivalent and no annotation was required. Equally, semantically equivalent terms were not marked in the script, for example 'I have a lot of pain' for 'I am very sore'. Furthermore, short paraphrases, such as 'gastroenteritis' for 'inflammation of the stomach and bowels', were marked as a valid translation and not annotated on the script. The assessors were also asked to give a pass or fail mark to each student considering how accurate and fluid their renditions were in both dialogues. BLEU scores were calculated from the annotated scripts using the Natural Language Toolkit. 2 This is the package for the programming language Python, which includes several resources for language analysis, corpus linguistics and machine translation. The scores from the students were higher than 0.7 ('very good'), which indicates that the translations were quite accurate and fluid. However, the oral assessment (a pass or fail mark) revealed that the students needed a much higher score (0.86) to get a pass. So it seems that the assessors were more demanding of the quality of the output. Incidentally, the analysis did not indicate that language was a relevant factor for scores. Once I had determined the passing threshold for the students, I was able to compare human interpreting with MT. The original assessment scripts in Arabic, French, Japanese, Mandarin and Spanish were translated using Google Translate and Microsoft Translator. The same assessors who marked the students' assessments marked these translations; they were not told that the scripts were translated by machines. As before, omissions, alterations and distortions were annotated on a copy of the scripts. Once again, these scores were very high, with a range from 0.84 to 0.95. I used the model that had been trained with data from the interpreting students to calculate the probability of passing for MT output. The results indicated that Spanish and Arabic translations had a high probability of passing, but Mandarin and Japanese had a low probability. In reality, only the Spanish translations got a pass from the assessors. Even the Microsoft translation with a BLEU score of 0.98 and pass-rate probability of 89% failed to pass the assessment. This is because a translation that is 'accurate' at word level may not be accurate at a pragmatic level. Take the sentence 'I will be unable to work for a while, and that means less money'. The translation 'If I am unable to work for a while, I will also be able to make money and change plans' will get a high BLEU score, even though it says the opposite of its reference. The low pass rate reflects a well-known challenge in MT: the farther apart the source and target languages are, the more difficult automatic translation becomes. IS BLEU A USEFUL TOOL? The performance of interpreting students and MT systems was high in comparison to the values reported in most MT studies. This was expected, given that syntactic differences were not marked and semantic equivalences were allowed. However, a high BLEU score does not mean the translation is good enough for a professional interpreter. BLEU scores do indicate a high match between candidate and reference, but they are not fine-tuned enough to pick up small lexical or syntactical mismatches that can have a huge impact on how a translation will be understood by a human recipient. For example, when the reference 'in your case, one option is acromioclavicular WARNING SIGNS Automated systems (top) can output major translation errors and still score highly, which could be dangerous in a medical setting (above) IMAGES © SHUTTERSTOCK