What Do Post-Editors Correct? A Fine-Grained Analysis of SMT and NMT Errors

The recent improvements in neural MT (NMT) have driven a shift from statistical MT (SMT) to NMT. However, to assess the usefulness of MT models for post-editing (PE) and have a detailed insight of the output they produce, we need to analyse the most frequent errors and how they affect the task. We present a pilot study of a fine-grained analysis of MT errors based on post-editors corrections for an English to Spanish medical text translated with SMT and NMT. We use the MQM taxonomy to compare the two MT models and have a categorized classification of the errors produced. Even though results show a great variation among post-editors’ corrections, for this language combination fewer errors are corrected by post-editors in the NMT output. NMT also produces fewer accuracy errors and errors that are less critical.


Introduction
Post-editing (PE) of machine translation (MT) has become a very common practice in the translation industry (Lommel and Depalma, 2016) in the last decade. PE increases productivity when we compare it with human translations (Aranberri, 2014) without having a negative impact on quality (Plitt and Masselot, 2010) and it also involves a reduction of costs, as post-editing is usually paid less per word than translating from scratch (Guerberof, 2009). Post-editors "edit, modify and/or correct pre-translated text that has been processed by an MT system from a source language into (a) target language(s)" (Allen 2003: 293).
Statistical MT (SMT) has been well established as the dominant approach in MT for many years. However, recent evaluation campaigns (Bojar et al., 2018;Barrault et al., 2019) have shown neural MT (NMT) outperforms previous systems in terms of quality.
These results have driven a technological shift from SMT to NMT in most translation industry scenarios. When assessing the usefulness of MT models for PE, it is essential to analyse the most frequent errors and how they affect the task. Although recent studies suggest that NMT reduces errors and produces more fluent outputs Castillo et al., 2017;Toral and Sánchez-Cartagena, 2017), each error type affects the PE effort differently (Daems et al., 2017).
Error annotation has been used to study the quality of the MT products (Vilar et al., 2006;Costa et al., 2015, Popovic 2018 and to investigate whether an MT output is fit for post-editing (Denkowski and Lavie, 2012). However, it is usually conducted as a separate task from post-editing, even though these two tasks are highly related. In fact, post-editing can be understood as an implicit error annotation, as the edits post-editors enter are intended as corrections of translation errors (Popovic and Arcan, 2016). Even though translators' edits may reflect preferential changes or style and do not always correspond to errors (Koponen, 2013;Koponen and Salmi, 2017;Koponen et al., 2019), we have annotated the actual modifications introduced into the raw MT output, as many correct translations for the same source text are possible. Moreover, analysing corrections from different translators working on the same text can give valuable insight to better más frecuentes y cómo afectan a esta tarea. Presentamos el estudio piloto de un análisis pormenorizado de errores en TA basado en las correcciones realizadas por los poseditores en la traducción de un texto médico realizada del inglés al castellano mediante TAE y TAN. Utilizamos la taxonomía MQM para comparar los dos modelos de TA y obtener una clasificación categorizada de los errores resultantes. Nuestro análisis incluye también una evaluación de las diferencias entre poseditores, centrada en los pasajes en los que la posedición presentaba mayor disparidad.
understand the variability patterns among translators. That is, how different professionals modify the raw MT output.
We present a pilot study of a fine-grained analysis of MT errors based on posteditors' corrections for an English to Spanish medical text translated with SMT and NMT.
Our goal is to study the type of errors post-edited for these two MT models for this text type and language combination, and analyse the differences among translators postediting the same MT output. In Section 2, we present previous work analysing the differences between these two MT models and the errors they produce. Then, we present the methodology we used, both the MT systems trained in our study and the PE set-up.
In the following section we detail the customized MQM taxonomy we used for the error annotation process and then we present the results of the annotation for each postedited version. Finally, we include a discussion of the results and detail our future research.

SMT versus NMT
Automatic metrics such as BLEU (Papineni et al., 2002) are currently used to assess MT quality and have been used to show that in many cases NMT models outperform SMT systems. For example, Junczys-Dowmunt et al. (2016) studied 30 different translation directions from the United Nations Parallel Corpus and Wu et al. (2016) assessed the quality of NMT and SMT outputs for Wikipedia entries translated with these two MT models. However, metrics like BLEU exploit mainly surface-matching characteristics that are largely insensitive to more subtle nuances and have been shown to underestimate NMT quality compared to the assessment conducted with rankings obtained by human reviewers (Shterionov et al., 2018). Moreover, more recent evaluation campaigns have confirmed NMT architectures produce better results with different automatic metrics and human evaluations (Bojar et al., 2018;Barrault et al., 2019). These campaigns, however, do not directly address errors and a more nuanced analysis of the errors produced is needed.
One of the first papers analysing the impact of SMT and NMT in post-editing was Bentivogli et al. (2016). They carried out a study on post-editing NMT and SMT outputs of English to German translated TED talks. They concluded that in general NMT decreased the post-editing effort, but SMT yielded better results for longer segments. Toral and Sánchez-Cartagena (2017) broadened the scope of the former paper adding different language combinations and metrics, and they concluded that although NMT yielded better quality results in general, it was negatively affected by sentence length, and the improvement of the results was not always perceivable in all language pairs. Bentivogli and NMT, with four language pairs and different automatic metrics and human evaluation methods. It highlighted some strengths and weaknesses of NMT, which in general yielded better results. It focused especially on post-editing and used PET (Aziz et al., 2012), a computer-assisted translation tool which enables recording time and keystrokes, to compare educational domain outputs from both systems using different metrics. They concluded that NMT reduced word order errors and improved fluency for certain language pairs, so fewer segments required post-editing, especially because there was a reduction in the number of morphological errors. However, they did not detect a decrease in PE effort nor a clear improvement in omission and mistranslation errors. Koponen et al. (2019) presented a comparison of PE changes performed on NMT, rule-based MT (RBMT) and SMT output for the English-Finnish language combination. A total of 33 translation students edited in this English-to-Finnish PE experiment. It outlined the strategies participants adopt to post-edit the different outputs, which contributed to the understanding of NMT, RBMT and SMT approaches. It also concluded that PE effort was lower for NMT than SMT. Klubička et al. (2017Klubička et al. ( , 2018 compared the errors produced by an English-Croatian pure phrase-based, factored phrase-based and NMT system performing a manual evaluation via error annotation of the systems' outputs. Two annotators used a metric compliant with MQM (multidimensional quality metrics) and results showed that NMT reduced the number of errors considerably. Ye and Toral (2020) also conducted a finegrained human evaluation to compare the transformer and recurrent approaches to neural MT for the English-Chinese combination. They followed a tailored MQM taxonomy and observed the transformer produced an overall better translation reducing the number of errors related to accuracy, fluency and comprehensibility.
Even though a product-based analysis of the errors produced in the MT output can help to understand the MT quality, it is not enough to measure the actual effort involved in PE, which Krings (2001) defined as the sum of three aspects: temporal, technical and cognitive effort. Some errors may be easy to identify but require a lot of editing, while others can be easily corrected but may be difficult to spot or solve. For example, lack of coherence, shifts in meaning and structural issues have proved to be good indicators of post-editing effort (Daems et al., 2017).
The analysis and classification of MT errors has been always used as a valuable tool to improve MT systems. Some automatic or semiautomatic tools have been developed to conduct this task. Addicter (Zeman et al., 2011) is a tool for the automatic detection and display of common translation errors which uses a first-order Markov model for aligning reference words with hypothesis words. Hjerson (Popovic, 2011) uses WER alignments and compares the sets of words identified as erroneous due to a mismatch with the reference. However, error classification is usually conducted manually because currently available tools are still not able to distinguish detailed error classes, and are prone to confusions between mistranslations, omissions and additions. This task is usually performed by annotators who identify the errors of the MT output with or without a reference translation. However, with the widespread use of post-editing in the translation workflow, the analysis of post-editing corrections is receiving more and more attention (Popovic, 2018), and can also be understood as an implicit error annotation, as the an error taxonomy tailored for Romance languages. In their study, highly ranked sentences clearly showed low number of grammatical errors, and a high inter-annotator agreement between two annotators was reported.
The translation industry has also developed error taxonomies which have been included in quality metrics. For the purpose of evaluation, many companies use errorbased models that seek to "identify errors, classify them, allocate them to a severity level and apply penalty points with a view to deciding whether or not the translation meets a specific pass mark." (O'Brien, 2011a, p. 58) The LISA QA metric 1 was initially designed to promote the best translation and localization methods for the software and hardware industries. Although it is no longer in use, its methods are still used in translation quality evaluation. This metric includes three severity levels, but there is no weighting. It consists of a set of 20, 25 or 123 error categories, depending on how they are counted. The SAE J2450 metric originated in the automotive industry and includes seven primary error categories which cover such areas as terminology, meaning, structure, spelling, punctuation, completeness, etc. and two severity levels. In contrast to LISA, it focuses on linguistic quality and includes no formatting or style issues. It also includes two meta-rules to help evaluators make a decision in case of ambiguity.
The TAUS Dynamic Quality Framework (DQF) 2 uses different tools, which include an error taxonomy, for the evaluation of translation quality. It was recently harmonized with the Multidimensional Quality Metrics (MQM) , which will be explained in detail in Section 4, to offer translation professionals and researchers a unified model.

PE set-up
We used the two previously trained MT systems to translate a 791-word fragment from   (Lommel et al., 2015) This framework offers the possibility of describing and defining custom translation quality metrics. Its goal is to provide a flexible vocabulary of quality issue types and a way to use these elements to generate quality scores. Instead of imposing a unique Sergi Alvarez-Vidal / Antoni Oliver / Toni Badia What do post-editors correct? A fine-grained analysis of SMT and NMT errors Revista Tradumàtica 2021, Núm. 19 139 metric for all situations, it provides a detailed catalogue of different quality issue types, including standardised names and definitions, that can be used to describe particular metrics for specific tasks. The hierarchical structure groups errors into different major issues (such as Fluency and Accuracy) which can be further specified into detailed error types. This enables different levels of granularity, from a coarse analysis to a fine-grained metric, and also facilitates the customisation of the framework for different language combinations. For example, if the analysis focuses on grammar errors, this category can be further specified to include a detailed error description for all the MT output issues encountered. It also includes a guide 16 for the annotators using the MQM framework, and a decision tree designed to standardize the categorization process.
For our analysis, we used four main categories: Terminology, Style, Accuracy, and for example, all grammar mistakes produced by the MT system. We have further detailed this category to specify the corresponding type of errors. Apart from punctuation, capital letters and spelling, we have grouped errors mainly taking into account the grammatical category of the error detected. Furthermore, we have included word order (which also includes the modification of the syntactical order of the sentence) and what we have called redundancy. This category usually refers to references within the same sentence or the previous sentence which the MT system has repeated, but that should have been omitted or mentioned with another sort of reference. That is, taking into account the 16 http://www.qt21.eu/downloads/annotatorsGuidelines-2014-06-11.pdf Sergi Alvarez-Vidal / Antoni Oliver / Toni Badia What do post-editors correct? A fine-grained analysis of SMT and NMT errors Revista Tradumàtica 2021, Núm. 19 140 context, there is a redundant translation, which constitutes a grammatical problem in the Spanish output. For instance, in the following segment "pacientes" was removed the second time it appears, as in Spanish lexical repetitions should be avoided within the same sentence if possible.

Results
All post-edited versions were manually annotated using the customized MQM taxonomy by one of the authors. He had extensive experience as a translator for this language combination and had previous experience annotating MT errors for research and industrial purposes. Once the annotation process was completed, we calculated the number of corrections per each category and the mean for each MT system. As it can be seen in As it was a medical text, a considerable number of errors were produced by the use of the wrong terminology. This is in line with previous research, which has shown that in-domain MT outputs usually present a high number of terminology-related errors (Hawakaya and Arase, 2020). However, even though the two MT models were trained with the same data, translators corrected more terminology issues in the NMT version.
If we remove from the total results the errors attributed to style, which in most cases correspond to an elective correction introduced into the MT output, results also show NMT output produced less errors (128 errors for SMT versus 119.5 for NMT). We also included a measure for each of the errors annotated according to the severity of the error: neutral, minor, major, critical. We used the four categories included in MQM and the definitions suggested by O'Brien (2011): • Neutral: Corresponds to stylistic corrections which do not really imply an error and it also includes corrections of issues, features and expressions that do not have a negative impact on the MT output.
• Minor: Noticeable errors that do not have a negative impact on meaning and are not confusing or misleading.
• Major: Errors that are considered to have a negative impact on meaning.
• Critical: Errors which have major effects on the overall meaning, and can compromise product usability, and consumer safety and health.  Table 3. Severity of the annotated errors post-edited by each translator.
As we can see in Table 3, critical errors were clearly reduced in the two NMT postedited versions, which seems to indicate that NMT was able to convey better the meaning of the source text. These results can be directly linked to the accuracy errors detected in both systems, in which NMT showed a better performance in reproducing the whole meaning of the source segment into the target.
Finally, we counted the number of words corresponding to each error corrected to calculate the error ratio (Klubička et al., 2018). For each version we divided the number of words that contain an error by the total number of words included in the final postedited version: Error ratio = Words with errors / Total number of words Sergi Alvarez-Vidal / Antoni Oliver / Toni Badia What do post-editors correct? A fine-grained analysis of SMT and NMT errors Revista Tradumàtica 2021, Núm. 19

143
As we can see in Table 4, the percentage of errors is consistent with the global number of errors annotated in each post-edited version. Even though there is a big variability among the SMT versions, the mean of the corrections introduced by the two post-editors (25.6%) is slightly higher than the mean corresponding to the translators who post-edited the NMT output (23.1%).
MT system and post-editor Error ratio 6. Discussion and future work PE is a practice that will increase in the near future, and it is necessary to assess the MT output and understand translators' corrections in order to ensure a satisfying postediting process and also a final translation of good quality. Error analysis will be a useful tool to achieve it. It can help detect the most frequent errors of each MT system and help prevent the repetitive errors which can be more tedious for post-editors. In our analysis for an English to Spanish medical text, the NMT slightly reduced the number of errors, especially the ones related to omissions or mistranslations from the source text.
This fact is reflected in the greater number of critical errors for the SMT version. Even though NMT is usually found to be more fluent than SMT, for this language combination and domain the mean of fluency errors was more or less the same, as was the number of style corrections. NMT conveys the source meaning better but still has problems producing publishable-quality documents for the medical domain.
There was also a great variability among translators. Even though we had only two post-edited versions for each MT output, post-edited versions with higher number of corrections tend to increase in the accuracy, fluency and style categories alike. The terminology category seems to increase independently from the other three. If we focus on the fluency category, the highest divergences can be found in word order and prepositions.
Our future experiments will include increasing the pool of post-editors for a certain text to study with more detail variability among translators and correlate specific error categories with an increased PE effort. We will also broaden the domains and language