The Riddle of (Literary) Machine Translation Quality

This study aims to gauge the reliability and validity of metrics and algorithms in evaluating the quality of machine translation in a literary context. Ten machine translated versions of a literary story, provided by four different MT engines over a period of three years, are compared applying two quantitative quality estimation scores (BLEU and a recently developed literariness algorithm). The comparative analysis provides an insight not only into the quality of stylistic and narratological features of machine translation, but also into more traditional quality criteria, such as accuracy and fluency. It is found that evaluations are not always in agreement and that they lack nuance. It is suggested that metrics and algorithms cover only parts of the notion of “quality”, and that a more fine-grained approach is needed if potential literary quality of machine translation is to be captured and possibly validated using those instruments.


Introduction
In recent years, firm claims about the progress of neural machine translation (NMT) have been made (Wu et al., 2016, Castilho et al., 2017).Initially, these statements focused on the quality of LSP translation (Chu et al. 2017, Speerstra 2018, Jia et al. 2019, Kosmaczewska and Train, 2019, among others).Recently, acknowledgements of possibilities for integrating machine translation (MT) into the production of translations of literary texts have been tested and increasingly research acknowledges that, owing to the improvements in quality of MT output, MT can also be leveraged in literary contexts in the near future (Voigt and Jurafski, 2012, Toral and Way, 2014, Toral and Way, 2018, Guerberhof and Toral, 2022).Moreover, for the language pair English into Dutch, research increasingly covers the grounds for establishing a fully integrated MT-supported or even MT-driven approach to literary translation production (Tezcan et al., 2019, Webster et al., 2020, Macken et al., 2022).
However, as Félix Do Carmo (2022) recently pointed out, claims about output quality often rely on misleading conceptual constructs of "quality".This is not surprising, given that "translation quality" has always been "an elusive and indeterminate" concept (see Holmes 1988 andMoorkens et al. 2018).However, the problematic conceptualisation and operationalisation of "quality" in the context of MT has led to some peculiar developments and misplaced claims.Automatic metrics, such as BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007) and COMET (Rei et al., 2020), have gained serious popularity in recent years.These systems seem to measure only specific aspects of quality, which the developers themselves acknowledge, yet, these metrics are widely used, primarily because of their (alleged) correlation with human evaluation.Frequent use of these metrics has led to misplaced confidence, with a good many scholars assuming they effectively do measure quality (see Do Carmo 2022, who refers to the highly insightful work of Marie 2022aMarie , 2022b)).This misplaced belief is reinforced both by scholars who rely on the metrics without taking a critical stance and scholars who present research Gys-Walt van Egdom / Onno Kosters / Christophe Declercq The Riddle of (Literary) Machine Translation Quality: Assessing Automated Quality Evaluation Metrics in a Literary Context. Revista Tradumàtica 2023, Núm. 21 131 on said automatic metrics in a highly selective manner (see Do Carmo, 2022).Additionally, in translation technology research, evaluative judgments are often passed by individuals who lack the required expertise to judge the quality: "Software engineers have judged MT like a blind person judging colour" (Van Egdom et al., 2017 [own translation, from Dutch]).Evaluation is often performed by people that were simply available and willing to help out.
This article contributes to the ongoing debate on quality in literary translation when MT is involved.The study aims to provide a fillip to the ongoing methodological discussion about the operationalisation of the notion of "quality" through the use of automatic metrics and algorithms.After a qualitative analysis of a literary short story ("Wrote a letter…" by the American writer Donald Barthelme), the quality of 10 first output translations produced by 4 different engines is analysed.First, these translations are assessed from a linguistic and stylistic vantage point by human evaluators with varying translation experience.In the second and third phase of the experiment, BLEU scores and literariness scores are calculated.The goal is to determine the extent to which human and automatic judgments effectively seem to correlate in the context of literary translation.This research takes into account NMT's potential for self-learning, and tries to do justice to developments in custom MT, by drawing a customised engine into the analysis.

Method
For this experiment, "Wrote a letter…" (524 words) was selected, a short story published by Donald Barthelme in 1980, which had not been translated into Dutch prior to this study.Subsequently, a literary translator with 15 years of professional experience, agreed to translate the story into Dutch, as if it were a proper assignment aimed at publication, thus creating a human translation (HT) that will act as a reference.The source text (ST) was then machine translated ten times.Three different open access NMT engines -DeepL, Systran and Google NMT -were used to provide one translation each.An annual translation snapshot was produced during a period of three years to track progress to determine if the capabilities of MT engines would improve over time as a result of selflearning as well as added data and algorithmic capacity.As customisation is considered to provide a major impetus to MT output quality, one more version of the ST was produced using a custom MT trained on literary data (Koehn and Knowles, 2017, Matusov, 2019, Saunders, 2022).The customised system used for this experiment was built by Toral et al. (see Toral et al. 2020, Toral et al 2021).The system, dubbed "S3Big", is based on a Transformer model and trained with in-domain data, in this case a parallel corpus of 500 English novels and Dutch translations, plus additional monolingual data, consisting of 4,435 literary works (see Toral et al., 2020, Toral et al., 2021).The custom MT not necessarily features as an additional aspect of the developments over time but as a testing ground of state-of-the-art customisation of engines for literary purposes (vsa-vs the main baseline engines).

Quality Assessment
A preliminary literature review was conducted to investigate the literary characteristics of the ST.For the qualitative analysis, a total of 28 text items were selected from the ST.Item-related methods are rooted in a tradition of analytical evaluation, while in human evaluation, a broad distinction can be made between holistic and analytical assessment.Both methods have their advantages and disadvantages (see Van Egdom et al. 2018).Holistic evaluation methods, on the one hand, consider the text as a whole, but the evaluation hinges mainly on the impression the assessor has of a target text.
On the other hand, the analytical method starts from characteristics of the source or target text but pays little attention to the text as a whole.Analytical methods simply aim at detecting errors in the target text (e.g.DQF, MQM framework).In this research, a preselection of items was used.Preselection methods are aimed at identifying difficulties in a translation task, taking into account the characteristics of the source text, the contrast between source and target languages and the translation brief.Items typically used in research on translation quality include words and fragments that are likely to pose a problem to translators and/or translation engines.(see Nord 1988). 1 For this study, the items or "rich points" were related to three criteria: 1. Accuracy 2. Fluency 3. Style.
The first two criteria are often used to assess general text quality of MT output (Wu et al. 2016, Castilho et al. 2017), while the third was employed to shed light on the literary characteristics of the MTs (Castilho and Resende, 2022).The text items were verified by two native Dutch assessors with profound literary expertise and near-native knowledge of the English language and the cultural frames involved.The same two 1 The selection of criteria for binary error analysis was influenced by this awareness of the translation task's inherent embedding within a specific historical framework.In other words, the selection was based on the fact that the TT was intended for a Dutch-speaking audience in the early 2020's, wishing to read a literary text that reflects the style and content of the ST.
assessors then evaluated the TT solutions, classifying them as either correct, undesirable or incorrect solutions.Undesirable solutions were later discussed by the assessors and categorised as either correct or incorrect.This evaluative classification formed the basis for a qualitative analysis of the MTs.A third assessor of equivalent profile in terms of source/target language, source/target culture proficiency, reviewed the quality estimations by the prime two assessors.
In the second part of the study, BLEU scores of the respective MT were calculated.
BLEU is a metric that has been used to make statements about machine translation quality (Papineni et al., 2002).BLEU scores are said to correlate with human evaluation of MT output, and BLEU metrics basically provide an indication of the formal similarities between several texts.The 10 MTs used for this study were compared separately with the HT, then the BLEU scores were compared with the findings of human assessors, more specifically, their findings related to the representation of accuracy and fluency.
In the third part, the literariness of the target texts (TTs) was determined using a literariness algorithm recently developed by members of the "The Riddle of Literary Quality" project (Van Cranenburgh et al., 2019; see also, Koolen et al., 2020;Van Dalen-Oskam, 2021).Based on a large reader survey, which yielded almost 14,000 responses, the project members have rolled out a supervised model that can predict (humaninformed) literary quality ratings from textual factors quite successfully (see Van Cranenburgh et al., 2019).Textual features that laid the foundation of this literariness algorithm are basic stylometric variables such as word frequency, density and positioning of specific sets of words in relation to their context and sentence length (see Koolen et al., 2020).For this study, the literariness scores were calculated for the MTs as well as for the HT.By dint of comparison (with findings from the qualitative analysis), the results of this analysis were scrutinised with a view to evaluating the usefulness of the algorithm.

Results
In order to assess the quality of the machine output, an attempt was made to pinpoint the literary qualities in the ST.A short literary review of literature on Donald Barthelme's writings (see Gordon, 1981, Couturier & Durand, 1982, Molesworth, 1982, Roe, 1992) showed that there is a general consensus about the textual features that make up the literary quality of his short stories.His postmodern short stories feature elements of the absurd to comment on the condition humaine.His stories are described as playful and unconventional in their narrative structure, while his characters are often disillusioned (see Taylor, 1977 andMcCaffery, 1980).
"Wrote a letter…" fits perfectly within Barthelme's oeuvre.The premise of the story is whimsical: the protagonist of the story corresponds with the President of the moon, using a range of unconventional communication means ("moonbeaming", "flights of angels").
The tone employed in the story is light and humorous, and the topics that are discussed in the correspondence are far from abnormal (mental health, a Honda that has been towed away, apartment rentals).This constant juxtaposition of or clash between the 134 surreal and the banal add to the otherworldly atmosphere of the story.The clash is also reflected in colloquial uses of languages ("You ever seen them…").The story's thematic absurdity is carefully constructed: the various means of communication used for correspondence become increasingly outlandish, which builds up a sense of disorientation or unease in the reader.
If a translation is to mimic Barthelme's sharp wit and the text's ability to challenge the reader, a high-quality literary rendition of this ST should reflect these characteristics.
At the same time, the criteria that apply in the context of more general quality assessment also apply in this context: it is imperative that the translation accurately reflects the meaning of the original, while making it understandable to its new audience.
Table 2 shows the ST elements that not only highlight the literary features that require attention in a qualitative analysis, but also textual features that may pose problems to the accuracy and fluency of a TT.In total, 28 ST features were selected, but it is important to note that there is a striking imbalance in the distribution of selected items: stylistic features account for the vast majority of the items (16 out of 28); while only 7 were used to assess fluency, and 5 to gauge accuracy.This disparity was due to the study's primary focus on literariness.Nonetheless, the evaluators acknowledged that some items that were categorised as stylistic features were also inextricably linked to either fluency or accuracy.The second criterion presented in In Tables 3, 4, 5 and 6, the overall results of a critical interrogation of the automated rendition of the same features are presented (for a full overview of items, see Annex A).
Evaluators were asked to perform a dichotomous assessment of the translated items, considering the criteria used for the assessment of the source text items.Evaluators were asked to indicate whether the item had been translated correctly or incorrectly (i.e., dichotomous evaluation).In cases where evaluators were experiencing doubts as to the correctness of a solution (undesirable but not necessarily incorrect solutions), they were 136 instructed to flag the respective item.Next, the evaluators engaged in a discussion until they reached a consensus regarding the correctness of solutions.
The tables below provide an overview of how well the 4 MT engines performed on the task.They do so in a highly insightful way, because they reveal the strengths and weaknesses of each engine and demonstrate whether any considerable progress has been made in terms of output quality.In other words, these tables allow for the reader to gauge the extent to which the MT engines have been able to capture accurately relevant features of the ST and provide a solid basis for a qualitative comparison of engines.
Accuracy (/5) Fluency ( 7) Style ( 16) Total (/28)  The results of the DeepL evaluation presented in Table 3 show that DeepL's overall performance is far from ideal, but it improved considerably from 2020 to 2021, and remained stable, without noticeable improvement, in 2022.The breakdown into categories provides additional insights into the system's strengths and weaknesses.It is abundantly clear that there is quite some room for improvement for accuracy, fluency and style, with scores of 1 out of 5 for accuracy, 1 and (later) 2 out of 7 for fluency, and 4 and (later) 6 out of 16 for style.Despite room for improvement, the performance in rendering the literary style of Barthelme is worthy of note.The stylistic prowess of DeepL is predominantly linked to the representation of the absurd.It is important to note that accuracy, as can be inferred from Table 2, has proven to be crucial for conveying absurdism effectively (see item 10 ["I cabled him"] and item 19 ["Drumming fiercely on… the moon frequency"]).In other words, the scores for accuracy are somewhat skewed.
One evaluator even stated that DeepL's ability to convey meaning remains relatively acceptable compared to other systems.Still, in all three years, output quality was consistently lower than anticipated; it was expected that scores for accuracy and fluency would be somewhat acceptable, i.e., that the system would attain percentages of >50% for both categories.
Systran failed miserably at producing accurate, fluent, and stylistically appropriate solutions.What stands out most in Table 4, is the fact that the system never came up with a single correct solution for items that tested fluency.Poorly translated items that stood out in this category were item 9 ("Which I would gladly carry up there") and 26 ("it looked … hand up there").Scores for accuracy were hardly better: in 2022, Systran opted for a borrowing of the source element "Space Shuttle Hurry-up Fund", which made sense to the evaluators.Still, the consistently low scores in all categories clearly suggest that producing a high-quality target text is something that is not within immediate reach, let alone producing a high-quality literary translation.
Accuracy (/5) Fluency ( 7) Style ( 16) Total (/28)  The output of Google was also evaluated on three criteria.Although Table 5 shows that, generally speaking, the engine performed poorly in 2020, with no more than 8 items solved correctly, it was actually top of the class in that year.In the following two years, the overall output quality did not get any better, but there are some differences between the three Google versions that are worthy of note.In 2021 and 2022, the engine demonstrated consistently better performance within the domain of style (7 out of 16).
Google NMT appears to have managed to gain a firmer handle on the literariness of the source text.Particularly interesting is the rendition of colloquialism: good cases in point are the translations of the ellipsis in item 4 ("Cost me …, plus") and the redundancy in item 5 ("tiny little cars").Absurdism is also a stylistic feature that the system managed Paradoxically, however, the scores for accuracy dropped from 1 to 0. Google NMT never  "accuracy" and "fluency" were still quite common in the text.A score of 2 out of 7 for fluency suggests that the text contains some awkward or unnatural phrasing, as in the case of items 9 ("Which I would gladly carry up there") and 12 ("it was bad").However, it produced a couple of very fluent solutions, as in the case of item 3 ("and I didn't like it").The same can be said about accuracy: items like item 2 ("towaway zones") prompted a completely incomprehensible noun in the Dutch version ("Daarsleepzones), but the Custom MT was the only system that produced an accurate solution for item 15 ("root cellar").Comparable to Google, the system was able to capture the literariness of the ST to a certain extent, with 7 items that were solved correctly.Interestingly, it proved to be inconsistent in its rendering of colloquialism, but it did a splendid job depicting Barthelme's absurdism.The customised system therefore shows potential as a tool for translating literary texts, but the future will tell whether further training and tweaking will yield better results or whether progress will be brought to a halt.
The qualitative analysis evaluation of MT engines for literary translation clearly revealed that there is still significant room for improvement.An impetus is not only required for the literariness of output, but also even the accuracy and the fluency of the automated renditions continue to be problematic.Based on this case study, it can safely be stated that automation in literary translation cannot be said to be an immediate prospect.

BLEU
In the next phase of the study, the performance of the MT engines was compared, using BLEU scores as a means of comparison.BLEU is a widely used metric for evaluating the quality of MT output.It is designed to measure the similarity between one or more MTs and one or more reference translations made by human translators.BLEU scores run from 0 to 100, where a higher score is indicative of a more accurate and fluent Looking at these results, it is tempting to state that the qualitative evaluation and the BLEU scores are fairly consistent.As in the qualitative analysis, Systran clearly seems to be missing the mark in many respects.Still, the observed improvements in the output quality of Systran in 2021 and 2022 are clearly at odds with the findings in the qualitative analysis: in terms of accuracy and fluency Systran made no significant strides.Another thing that stands out, is the striking disparity between the qualitative and the quantitative analysis of Google NMT.Based on the qualitative evaluation, one would expect Google to be more or less on a par with DeepL, at least in 2020 (see Tables 3 and 5).However, according to the BLEU calculation, Google only managed to catch up with DeepL in 2021, at a time when scores for accuracy and fluency dropped.A possible explanation for sudden improvements of estimated output quality in 2021 and 2022 seems to lie in Google's rendition of lengthy stylistic items (e.g., items 19 and 23).A system that has been passed over in silence in the diachronic comparison, is the custom-built engine.As in the qualitative analysis, the system outpaced Systran and Google, but it remains remarkable that the custom MT system lost out against DeepL, which clearly carries the day in this first quantitative analysis of "overall" output quality.According to the qualitative evaluation, no MT system came near the custom MT in terms of overall quality.
At first glance, there seems to be no striking disparity between the results of the qualitative and the first quantitative analysis, which might be interpreted as an indication that BLEU does correlate with human judgment.This can be accounted for quite easily: both analyses did show that all 4 MT systems still struggle to meet the passing threshold for an "understandable or even good translation".However, upon closer inspection it becomes clear that BLEU does nothing but measure similarities: therefore, it fails to come to grips with subtle but meaningful changes in output quality, and in no way does it reflect improvement or deterioration over time.).As such, it was never designed to measure the literariness of MT output.In the next phase, an attempt was made to evaluate the literary qualities of MT versions of Barthelme's text, using automatic metrics developed to predict the literariness of a text.One such metric system ensued from the "The Riddle of Literary Quality" project (see Koolen et al., 2020).The literariness algorithm relies on word embeddings and attempts to evaluate (or predict) the "literary quality" of Dutch texts by looking at frequency, density and positioning of specific sets of words in relation to their context and sentence length.The biggest advantage of this algorithm is that it does not require a reference text to calculate the literariness.In other words, this specific literariness algorithm could be used to measure the literary qualities of the MTs as well as of the HT that was used as a reference translation in the BLEU experiment.The results of this measurements are shown in Table 9.
What stands out in Table 9, is that DeepL MTs consistently score the highest, although their literariness does decline slightly over time (4.41 in 2020 and 4.37 in 2022).Despite the decline, DeepL remains unthreatened: in terms of literariness, the MT engine even structurally outperforms the HT (3.8), according to the algorithm.The customised engine also seems to deliver when it comes to literariness: with a score of 4.17, it is the runner up in the (2022) series.The literary quality scores of Google NMT are remarkable: they show good prospects in 2020 (4.08), beating HT with almost 0.3, but the system experiences a free fall in the following two years (3.27 in 2020).A similar effect is seen in the Systran scores; however, the decline in literariness is limited to less than 0.2 and the system seems to keep pace with the HT.These results are remarkable, but they become more peculiar when considered in relation to the results obtained in previous paragraphs.In the qualitative analysis, it was found that while DeepL does seem to represent stylistic elements of the ST in a reasonably accurate manner, the system completely missed the mark when it came to colloquialism and fluency.Therefore, the high rating of the literary quality was quite unexpected.Google, on the other hand, proved to be quite capable when representing the colloquial characteristics of the ST, and it even showed improvement in 2021 and 2022.However, the literariness algorithm offers a rather bleak perspective on the literary qualities of Google's output in 2021 and 2022.The relative stability of Systran is also highly problematic: in terms of literary quality, Systran fell short in all versions, yet its In this study, human evaluation served as a foundation for the analysis of the automated metrics.Initially, the BLEU scores appeared to align reasonably well with the assessments recorded in the human evaluation: scores consistently remained below 30

HT
("the gist is clear, but [the MT] has significant errors").However, it is noteworthy that the Custom LMT engine, which ranked the best in the human evaluation, did not hold the top position according to BLEU scoring.Additionally, the distinction between the weakest system, Systran, and the best-performing systems was not very pronounced.
What is more, BLEU failed to capture subtle or pronounced differences in quality of MT versions across the different years.Based on these findings, it can be asserted that BLEU does not provide sufficient insights into MT quality, particularly in a literary context.
The discussion then transitioned towards assessing the literariness of MT output, a task not originally within the BLEU scope.This part of the study attempted to evaluate the literary qualities of MT's as well as the (human) translator's version of Barthelme's text, using automated metrics designed to predict literariness.Interestingly, the human translation did not receive the highest score.The literariness algorithm identified DeepL as the most literary translation, with the custom-built engine securing the second position.143 The literariness of Google NMT, which human assessors found reasonably adept at capturing Barthelme's colloquial style, was paradoxically rated very low by the algorithm.
Surprisingly, the Google NMT versions from 2021 and 2022 even scored much lower on literary quality than Systran, which, in terms of literariness, performed similarly to human translation.These findings cast doubt on the literariness algorithm's ability to accurately gauge literary quality.

Conclusion
While there has been increasing optimism about the improvements in MT quality, including in literary contexts (Chu et al. 2017, Speerstra 2018, Jia et al. 2019, Kosmaczewska and Train 2019, Voigt and Jurafski 2012, Toral and Way 2014, Toral and Way 2018, Guerberhof and Toral 2022), claims about the effectiveness of MT seem to be based on inadequate measures of quality.This study has set out to critically interrogate the possibilities and limitations of using metrics (such as BLEU and the recently developed literariness algorithm) to (re)assess the quality MT output in a literary context, with emphasis on the evaluation of stylistic and narratological features of said translation.
This study has shown that BLEU was unable to capture subtle changes in output quality.
This became all the more apparent when BLEU was assessed diachronically.The literariness algorithm also fell short, as it failed to pick up on the literariness of the HT and shamelessly overpraised the literary qualities of MT output.This study therefore serves as a corrective countervoice to potential hypes surrounding MT, especially for the use in literary translation contexts, and might encourage experts in MT to reconsider the use of metrics to assess (literary) MT quality or, at least, (further) refine existing models that purport to "measure" MT quality (see Do Carmo, 2022, Marie, 2022a, 2022b).
At the same time, it should be noted that this study has some limitations.It does not seem superfluous to point out that research was limited to no more than one single source text and 10 MT versions.In other words, far more research is needed to gain a more complete understanding of the ability of MT to represent accuracy, fluency, style -and, more specifically, literary style -or the lack thereof.Adding more source material and automated renditions to the equation might help.Additionally, research with multilingual material and texts from different genres and literary movements seems in order so as to further expand our comprehension of the state of MT.However, the main goal of this article is to draw attention to the inherent reductionism of automated measurements and their limitations in capturing the full complexity of literary translation.
Its research design may serve as a blueprint for further research and help fuel the metamethodological debate.
to highlight in the automated output in 2021 and 2022 (see items 19 ["Drumming fiercely on… the moon frequency"] and 23 ["by means of … my Apple computer"]).As mentioned earlier, these items also require an accurate understanding of the source text.Again, this could lead us to believe that scores for accuracy are somewhat skewed.

Table 1 .
Overview of research materials

Table 2
overview of how ST features have been rendered in the HT (later used as a yardstick for the BLEU score calculation).Gys-Walt van Egdom / Onno Kosters / Christophe Declercq The Riddle of (Literary) Machine Translation Quality: Assessing Automated Quality Evaluation Metrics in a Literary Context.Revista Tradumàtica 2023, Núm.21 135 Table 2. Categorisation of ST items, flanked by HT solutions.

Table 4 .
Quality evaluation of Systran, based on human assessment

Table 5 .
Quality evaluation of Google NMT, based on human assessment Gys-Walt van Egdom / Onno Kosters / Christophe Declercq The Riddle of (Literary) Machine Translation Quality: Assessing Automated Quality Evaluation Metrics in a Literary Context.Revista Tradumàtica 2023, Núm.21138 came up with a correct solution for seemingly simple items like items 2 ("towaway zones") and 8 ("A bucket of ribs").In sum, the total score of 8 out of 28 indicate that the system is competitive with DeepL, that the system even seems better at prioritising style, but also that it still has a very long way to go, if it is to produce output usable

Table 6 .
Quality evaluation of DeepL, based on human assessmentAn interesting new player on the market is the Custom-built MT engine.The table above indicates that the system, although it was still found lacking, performed moderately well on the translation task, with a total score of 11 out of 28.With this score, it even outperformed all the other systems.Errors related to the basic output quality measures

Table 7 .
Gys-Walt van Egdom / Onno Kosters / Christophe Declercq The Riddle of (Literary) Machine Translation Quality: Assessing Automated Quality Evaluation Metrics in a Literary Context.Revista Tradumàtica 2023, Núm.21 139 translation (see Table 7, see, Rekha et al. 2022: 45).Scores are grouped into ranges: scores of less than 10 awarded to texts that are considered useless; scores between 10 and 19 denote a higher likelihood that the gist of the target text is difficult to grasp; scores between 20 and 29 denote that the gist is clear but there are considerable errors in the output, and so on.As a rule of thumb, MTs with scores of 30 and upwards are generally seen as understandable or even good translations.BLEU scores are usually presented as useful since they provide a quantitative way of assessing the quality of MT systems and they correlate with human judgment.BLEU scores were calculated for the 10 MTs used for this study.Rough guideline for interpreting BLEU scoresAs can be inferred from Table8, DeepL clearly outperformed the other two off-theshelf systems in 2020, scoring nearly 4 percentage points higher than Google NMT and 5 percentage points higher than Systran.In the following two years, DeepL stayed in the lead, but both Google NMT and Systran managed to narrow the gap.Systran showed some improvement: in 2020, it produced a TT that, according to BLEU score interpretation, did not even allow readers to glean the gist, with a meagre score of 19.33, but in the following years it performed marginally better, narrowing the gap with DeepL to 4 percentage points.Google NMT also seemed to be lagging in the first year (21.25), but it made steady progress in 2021 (with a score of 27.24).It fell back a bit in 2022 (27.01).In 2022, the customised engine was also put to the test for the first time: with a BLEU score of 27.25, it did not seem to challenge DeepL just yet, but it did a splendid job outperforming the other systems.

Table 8 .
BLEU scores of MTs