Evaluating the effectiveness of machine translation of audio description: the results of two pilot studies in the English- Dutch language pair

The field of translation is undergoing various profound changes. On the one hand it is being thoroughly reshaped by the advent and constant improvement of new technologies. On the other hand, new forms of translation are starting to see the light of day in the wake of social and legal developments that require that products and content that are created, are accessible for everybody. One of these new forms of translation, is audio description (AD), a service that is aimed at making audiovisual content accessible to people with sight loss. New legislation requires that this content is accessible by 2025, which constitutes a tremendous task given the limited number of people that are at present trained as audio describers. A possible solution would be to use machine translation to translate existing audio descriptions into different languages. Since AD is characterized by short sentences and simple, concrete language, it could be a good candidate for machine translation. In the present study, we want to test this hypothesis for the English-Dutch language pair. Three 30 minute AD excerpts of different Dutch movies that were originally audio described in English, were translated into Dutch using DeepL. The translations were analysed using the harmonized DQF-MQM error typology and taking into account the specific multimodal nature of the source text and the intersemiotic dimension of the original audio description process. The analysis showed that the MT output had a relatively high error rate, particularly in the categories of Accuracy – mistranslation and Fluency – grammar. This seems to indicate that extensive post-editing will be needed, before the text can be used in a professional context.


Introduction
Language technologies have had a profound impact on the field of Translation Studies.
Globalization and digitization have made society at large ever more aware of the role of technology in the translation process, particularly in (digital) media and audio-visual products. The introduction of machine translation systems has been one of the major driving forces in this development. Since the turn of the millennium the advent of machine translation (MT) has significantly changed the way in which we translate (Bywood et al., 2017;O'Hagan, 2019;2020). Over the last few years, concerns about MT as a threat to the translator's profession have given way to a more appropriate recognition of the active mediating role this technology takes in the translation process (O'Hagan, 2020).
Indeed, the question is no longer whether or not we will accept MT as an alternative to translation from scratch, but how we can integrate it into our workflows and how it can improve both the quality and efficiency of the translation process (O'Hagan, 2019;2020).
Given the growing body of legislation that makes it mandatory for broadcasters and other providers to make their audio-visual productions accessible, the question of the usability of MT is gaining relevance in the field of media accessibility as well. A key factor in the discussion for (commercial) market players is the challenge of balancing speed, quality and cost. Media service providers have to increase the access of their content through, for instance, audio description (AD) to offer people with a visual impairment equitable access to information and entertainment stipulated by a growing body of European and local legislation. In March 2019, for instance, the European Union adopted a new Accessibility Act (EAA)1 and an update of the European Audiovisual Media Services Directive (AVMSD).2 The EAA requires companies to make their websites, software and apps accessible within five years from the adoption of the act, which includes accessibility for people with sight loss, while the AVMDS states that member states have to make media services accessible to people with sight or hearing loss. Although these sido audiodescritas en inglés. Para analizar las traducciones se utilizó la tipología de error DQF-MQM armonizada y se tuvo en cuenta la naturaleza multimodal específica del texto de origen y la dimensión intersemiótica del proceso original de audiodescripción. El análisis mostró que el resultado de la TA presentaba una tasa de error relativamente elevada, especialmente en las categorías de Exactitud/Error de traducción y Fluidez/Gramática. Esto parece indicar que, antes de que el texto pueda ser utilizado en un contexto profesional, será necesario un proceso de posedición exhaustive.
Palabras clave: accesibilidad a los medios; audiodescripción; traducción automática; traducción Evaluating the effectiveness of machine translation of audio description: the results of two pilot studies in the English-Dutch language pair Revista Tradumàtica 2021, Núm. 19 229 directives do not directly mandate the provision of AD and only time will tell how each member state translates the directives into stringent national legislation, it is clear that over the next few years public and private bodies will have to drastically step up their efforts to make audio-visual content in Europe more accessible for people with sight loss, by providing services such as AD. This constitutes a major challenge, particularly for smaller languages such as Dutch, for which AD is still a relatively new phenomenon (Reviers, 2016).
One way to meet these new quantitative demands is by translating existing ADs.
Fernández-Torné and  observe that ADs are generally still created as intersemiotic translations of the original audio-visual text, and interlingual translations of existing descriptions are very rare exceptions to that rule. This may seem surprising since there are still more (audio-visual) translators trained in interlingual translation than in AD, which would make translation a logical choice over description to rapidly increase the amount of audio described audio-visual content. In addition, translating ADs not only offers potential for meeting new market demands quickly. It is also a feasible solution for private broadcasters and companies in regions like Flanders and the Netherlands who distribute a lot of non-Dutch content with subtitles but no Dutch AD. In such cases, people with sight loss largely remain excluded from social and cultural life since they can only access the audio subtitles provided, i.e., a spoken version of the on-screen subtitles. Translations of non-Dutch ADs that can be used in combination with audio subtitles is therefore a crucial consideration.
In other words, it would be interesting to explore the idea of translating ADs in combination with the use of machine translation systems. However, very little scientific research and systematic evaluation are currently available to support the application of MT for AD. While there have been several studies on machine translation for subtitling (Matusov et al., 2019;Matamala and Ortiz-Boix, 2016;Álvarez Muniain et al., 2016;Etchegoyhen et. al., 2014;Del Pozo, 2013) and several machine translation subtitling systems are slowly finding application in the industry, machine translation systems for media accessibility and AD in particular have rarely been developed and have not been studied as yet. Only a handful of publications so far report on research into the translation of ADs (see section 2) and, to the best of our knowledge, only two smallscale exploratory studies focusing on the Spanish-Catalan and English-Catalan language pairs, have published results on MT for AD (Fernandez-Torné and Matamala, 2016;Matamala and Ortiz-Boix, 2016).
Against this background, this paper addresses the use of machine translation for AD.
It reports on a case study conducted in 2019 by two students in the Master in Translation of the University of Antwerp (Uiterwijk, 2019;Bryssinck, 2019) and replicated by the authors of this study in 2020. In both the initial case study and the replicated study, the machine translations of three excerpts from English AD scripts into Dutch were manually evaluated. For this study, the neural MT solution of DeepL3 was used. The focus of both case studies was on identifying the types of errors that occur in the Dutch Evaluating the effectiveness of machine translation of audio description: the results of two pilot studies in the English-Dutch language pair Revista Tradumàtica 2021, Núm. 19 230 MT and evaluating the extent to which the most common types of errors correlate with or can be explained by the idiosyncrasies of AD as a text type. This paper starts with a discussion of the current state of the art in (machine) translation of AD. Then, we will present the analytical framework for the case studies and the methodology adopted for error categorization. Finally, we will present a thorough discussion of the results and correlate them to the specific characteristics of the AD text type.
2. State of the art AD translation, including MT of AD, is an area that has not received a lot of academic attention so far, given its still limited application in practice. Only a handful of publications address research into the human translation of ADs. The focus of both Jankowska (2015) and Lopez Vera's (2006) studies was on evaluating the efficiency and effectiveness of human AD translation as compared to creating an AD from scratch. Both studies reported positively about AD translation as an alternative to AD creation: it is less time-consuming and can help to increase the offer of AD quickly and cost-efficiently. More importantly, AD translation seems to generate good quality results. Jankowska (2015) and Herrador Molina (2006) tested the reception of translations of ADs with users in Poland and Spain respectively. In both studies the translations from English were evaluated positively overall.
As a result, Jankowska concludes that "the scripts created as a result of the strategy of translating can be at least equal in quality to those which are the results of the strategy of writing." (Jankowska, 2015, p. 117).
A series of other studies (Herrador Molina, 2006;Bourne and Hurtado, 2007;Jankowska, 2015;Remael and Vercauteren, 2010;Jankowska, et. al., 2017;Liu, Tor-Carroggio, Rovira-Esteva and Casas-Tost, 2021) have looked into the actual translation process, discussing a series of AD-specific translation problems in different language pairs, namely from Polish into English and from English into Spanish, Dutch and Chinese.
These preliminary studies have flagged a few potential translation crisis points (Pedersen, 2008, p. 101) -i.e., problematic passages that require active decision-making on the part of the translator such as linguistic, syntactical and cultural differences. While it is still too early to develop any general frameworks based on the above studies, they do point to potential AD-specific translation problems that are relevant for both the study of human translation and machine translation. These are issues AD translators should be aware of when they adapt their translation to the respective target audience. In the case of MT, these issues would have to be checked during the post-editing process.
First, there are linguistic and stylistic differences in the way AD is formulated between language pairs. An example is sentence length and complexity. Bourne and Hurtado (2007), Molina (2006) and Remael and Vercauteren (2010) all mention that complex English sentences from the source text were often adapted in the human translated TT into coordinating sentences or a series of simple sentences. Remael and Vercauteren (2010) also noticed the frequent use of the present participle in English AD, a grammatical form that cannot be easily transferred to Dutch where it is used in a different way. To give one final example, Liu, Tor-Carroggio, Rovira-Esteva and Casas-Tost (2020) noticed differences between English and Chinese scripts in terms of the level of explicitness, the way characters are named and described and how much information is conveyed in the AD. Second, the audio described text that is translated is part of a larger multimodal text with which it interacts on many levels (Reviers, 2018b). Indeed, the audio-visual product that constitutes the source text of the AD is a multimodal construct that, in addition to images and sound effects, generally contains dialogues in the source language. As such it seems logical that ADs are directly created in the same language as these dialogues rather than in a different one -by people who are not native speakers of that original language -in order to guarantee maximum inter-and multimodal coherence between the sound effects, dialogues and descriptions. The above studies mention that this might be an issue for AD translation. One example are cultural references. As Remael and Vercauteren (2010) point out, particularly in the case of AD, at least part of the translation problem of cultural references is closely related to the visual context/source. The way a cultural reference can be translated depends on the way in which it is simultaneously depicted on screen or audible in the soundtrack or in the audio subtitles (AST) or dubbed dialogue. However, the multimodal nature of the source text is likely also to impact on other levels of AD translation than cultural references alone. Previous research into the language of AD (such as Reviers, 2018a) shows the close interaction between an AD and the sounds and dialogues with which it is combined. This suggests that when translating ADs, this multimodal cohesion is a key feature to keep in mind as it needs to remain intact in the translated version as well.
A final issue is timing. The number of words used in the source text and its translation may differ from one language to another: for example, a Dutch translation of an AD may be a few words longer than the original English version. While this may not constitute a problem in more traditional instances of interlingual translation, it is a crucial element in the multi-and intersemiotic context of AD translation, since ADs -like other forms of audio-visual translation such as subtitling and dubbing -always have to be adapted to the time available between dialogues and sound effects. Depending on the language pair, this could mean that the translation of the existing description has to be shortened and that, in some cases, information may have to be omitted.
To the best of our knowledge, two studies on MT of AD have been published so far 232 through keyboard and mouse interaction, and cognitive effort as measured through pause to word ratio, "seem to be less demanding in post editing" (Fernández-Torné and Matamala, 2016, p. 80). On the other hand, subjective indicators suggested that postediting was perceived to be the most demanding task when compared to creating AD from scratch and translating existing ADs. In the interviews conducted after the experiment, participants indicated that they felt their creativity was impaired when they had to perform post-editing of MT output. Matamala and Ortiz-Boix (2016) subsequently conducted a study focusing on the effectiveness of machine translation for the translation of AD scripts for the Catalan-Spanish language pair. Their corpus consisted of the Catalan AD from the first episode of a Catalan television series and the Catalan AD of a movie, resulting in approximately 90 minutes and 4,384 words of AD (Matamala and Ortiz-Boix, 2016, p. 17). They opted to carry out a subjective evaluation performed by a human, based on a list of error types which would be looked at (Matamala and Ortiz-Boix, 2016). These errors comprised missing words, untranslated words, extra words, wrong word order, wrong agreement, incorrect words and mistranslated words. The errors, in descending order of frequency, were wrong word order, wrong agreement, incorrect words, mistranslated words, untranslated words and missing words. They also reported that about half of the output sentences contained at least one error (Matamala and Ortiz-Boix, 2016).
To conclude, the previous studies are inconclusive when it comes to both the efficiency and the effectiveness of MT for AD. Matamala and Ortiz-Boix (2016) suggest that the post-editing effort required to bring MT of AD to an acceptable level of quality is considerable, given the high number of errors. More research is required into both the post-editing process as well as the types of errors most frequently encountered and the reasons for these errors.

Methodology
Since the main aim of the present study is to obtain preliminary insight into the types of mistakes that can be found in machine translations of English ADs into Dutch, various specific parameters in terms of materials and assessment procedure were taken into account when designing the methodology. The aspects discussed below apply to both the initial case study of 2019 by Uiterwijck and Bryssink as well as its replication by the authors of the present paper in 2020.

Materials
The existing corpus of Dutch ADs that are translations of ADs originally created in another language is still very limited. As mentioned in the introduction, ADs are mainly created from scratch in the same language as the audio-visual product. In the case of Dutch, that means that Dutch ADs are created only for Dutch productions. ADs from the UK or the US are rarely translated into Dutch, even though the amount of content with AD from these countries far exceeds the numbers of ADs created in Flanders and the Netherlands combined (Reviers, 2016).
Evaluating the effectiveness of machine translation of audio description: the results of two pilot studies in the English-Dutch language pair Revista Tradumàtica 2021, Núm. 19 For the purpose of our analysis, three Dutch feature films that were originally described in English were selected, namely Blind (Van den Dop, 2007), Zwartboek (Verhoeven, 2006) and Het leven is vurrukkulluk (Weisz, 2018). At the time of the study, these were virtually the only translated ADs available for this language pair. For each of the three films, the English AD of the first 30 minutes was transcribed and segmented into individual AD blocks, yielding 131 AD blocks or 2,600 words for Blind (Van den Dop, 2007), 120 blocks or 2,051 words for Zwartboek (Verhoeven, 2006) and 86 blocks or 1,400 words for Het leven is vurrukkulluk (Weisz, 2018). Since a pre-analysis of the MT of the AD of Zwartboek (Verhoeven, 2006) indicated that there were more than 150 errors in the first 30 minutes of the AD, it was decided to limit the case-study to 3 times 30 minutes to keep the workload involved in the human quality assessment feasible for this first pilot study. In addition, it resulted in a data set that was comparable to that of the earlier case study by Matamala and Ortiz-Boix (2016) in terms of AD time and word count.

MT Engine
For the machine translation of the source data, it was decided to use the general and freely available Neural MT engine DeepL for both case studies. Several reasons guided this choice. First, we wanted a system that was freely accessible online for reasons of replicability and availability. The main freeware engines available are DeepL and Google Translate. A pre-analysis of the MT of one of the three films selected did not generate significant differences in terms of the number of errors between DeepL and Google Translate for the text selected. Given that it was beyond the scope of our study to perform an in-depth comparison between the two engines, we opted for DeepL, but for the future it may be worthwhile to compare Google Translate and DeepL, for example in terms of types of errors made and post-editing effort required, to see if one is more suitable for MT of AD than the other. Second, we opted for a Neural Machine Translation system (NMT), because these are quickly becoming the standard in the industry over rule-based (RBMT) or statistical MT (SMT) systems. In NMT systems, one large neural network (NN) is trained on a vast amount of data consisting of full sentences and their translations. In these systems, the encoder maps the source sentence into a vector representation, and the decoder predicts the target sentence using that representation. In order to do this, the encoder-decoder network is jointly trained to maximize the probability of a correct target sentence, given a source sentence (Cho et al., 2014). This is different for SMT systems, which consist of many small sub-components that are tuned separately (Bahdanau et al., 2015). These systems use data to train a probabilistic model and choose the translation with the highest probability, given a certain source phrase. While these two systems heavily depend on large corpora, RBMT systems use extensive dictionaries and linguistic rules to translate sentences.
Third, at the time of writing, an MT system that was specifically trained for the translation of AD was not yet available. Furthermore, research on the linguistic and stylistic specificity of AD and its differences and similarities with other text types (such as novels, subtitling, spoken and written language) is too scarce to be able to arrive at any solid hypotheses about what type of content would be most similar to AD (Reviers, 2018a;Arma, 2011) and, therefore, what would be most appropriate content to use as MT input. Some scholars have put forward the hypothesis that, due to the characteristics of the language of AD, general engines might be a good candidate for automatic translation (Fernández-Torné, 2016;Salway, 2004;. Based on an analysis of existing guidelines, for instance, Vercauteren (2007) concludes that the language of AD "should sound natural and that unusual vocabulary or formal phrasing have to be avoided […], sentences should be kept simple […] and complex sentence structures including many subordinate clauses must be avoided" (p. 144). Similarly, Salway (2004)

Translation Quality Assessment
As demonstrated by Castilho, Doherty, Gaspari, and Moorkens (2018), "MT quality can be assessed in a wide range of different ways, and no single approach or metric is sufficient to address all evaluation purposes and scenarios" (p. 24). Automatic systems, such as BLEU (Papineni et al., 2002), METEOR (Banerjee & Lavie, 2005) or chrF3 (Popović, 2015) 4 are regularly used in various fields, ranging from technical texts to audio-visual content and literary MT to assess large scale productions and to analyse translation quality when the time that can be devoted to assessment is limited. While automatic quality assessment is said to be objective and tends to be inexpensive, it has also been claimed "that it is less comprehensive than manual evaluation and does not readily indicate the type of problems that the translated text contains" (Castilho, Doherty et al., 2018, p. 25). In light of this latter observation and since manual evaluation offers the means to obtain a fine-grained overview of the error types encountered in the translation (e.g., Popović, 2018;Lommel, 2018), which is relevant for the present study, we decided to evaluate the raw MT output generated by DeepL manually.
Earlier studies have resorted to both amateur evaluators and professionally trained evaluators.  point out that while "professional evaluators can be assumed to provide more reliable results, amateurs may be equally helpful in some TQA [Translation Quality Assessment] tasks" (p. 23). Moreover, Lommel (2018) indicates that an analytic evaluation -such as the one we are performing in this study -"is timeconsuming and requires training for evaluators to apply consistently" (p. 122). Since there was no time to train evaluators for this study and since the main aim of the present study was to obtain a general overview of the main error types rather than an exact account of the number of errors in each category, the evaluation for our case study Evaluating the effectiveness of machine translation of audio description: the results of two pilot studies in the English-Dutch language pair Revista Tradumàtica 2021, Núm. 19 235 was carried out by the research team. The analysis was first undertaken by an MA student in Translation at the University of Antwerp and then replicated by the authors of the present study: it was checked by a PhD student in Translation Studies, specialising in AD, after which it was validated by two researchers specializing in AD.

Error analysis method
As discussed above, the English AD of the first 30 minutes of three films was transcribed and segmented into individual AD blocks, adding up to a total of 343 AD blocks (See table 1 below). The length of these AD blocks in the source text (ST) varied between two and ninety words, with an average of 17.5 words per AD block. The segmentation was based on the timecodes provided with the written AD scripts that were used to record the AD. A block of AD can consist of one word or several sentences and usually ends when (a) a character in the audio-visual production begins to speak, (b) there is a significant sound that cannot be covered by AD, or (c) there is a significant pause after a sequence of AD. Each of these AD blocks was then translated using DeepL by copying and pasting several blocks of the scripts at a time, up to the maximum allowed in the free version of the online engine.
Given the novelty of (machine) translation of AD, there are no frameworks as yet that are commonly used to evaluate the quality of the translated output. For their analysis of the translation of AD from Catalan into Spanish, Matamala and Ortiz-Boix (2016) created their own typology, as described in section 2. However, this typology is limited and does not fully take into account the specific multimodal context in which the translation of AD takes place. Therefore, for the translation quality assessment and error classification in our study, the harmonized DQF-MQM error typology 5 (DQF-MQM) was used. This error typology is the result of the integration of two similar frameworks, i.e.

the TAUS Dynamic Quality Framework (DQF) and the Multidimensional Quality Metrics (MQM) framework, developed by DFKI (Deutsches Forschungszentrum für Künstliche
Intelligenz GmbH), into one hierarchic typology containing eight error categories and a total of 50 issue types (Lommel, 2018). For this study, two subcategories were added to category 8 (Other) of the DQF-MQM typology in order to accommodate the nature of the source text: 8.1 Mistake in source; and 8.2 Unnecessary translation. This comprehensive harmonized error typology, that was designed to evaluate both human and machine translations, allows for practical and easy error categorization and is one of the standards for quality assessment in both research and industry. For these reasons and with a view to replication in the future, this framework was preferred over the creation of an error typology from scratch or using the one developed by Matamala and Ortiz-Boix (2016).
The analysis was conducted in an Excel spreadsheet for each film. The first columns contain the two parallel texts, i.e. the English AD script divided in AD blocks aligned with the Dutch NMT, followed by columns containing the errors and the error annotation within the DQF-MQM framework.
Evaluating the effectiveness of machine translation of audio description: the results of two pilot studies in the English-Dutch language pair Revista Tradumàtica 2021, Núm. 19

4. Analysis and discussion
In this section, we will discuss the results of the translation of the three AD scripts by DeepL and give an overview of the most common types of errors found in the output.
With 520 marked errors on a target text (TT) of 6,374 words and 69.7% of all translated AD blocks containing errors (see table 1), it can be said that NMT does not deliver an output that is ready for use without thorough revision and post-editing. On a sentence level, this error rate is considerably higher than the error rates of the rule-based MT (RBMT) and statistical MT (SMT) used in the study by Matamala and Ortiz-Boix (2016), who found that 57.8 and 42.2% respectively of the machine translated sentences in their study contained at least one mistake. This marked difference can be explained by two factors. Firstly, we did not work with individual sentences but rather with AD blocks that often consisted of more than one sentence. Secondly, Matamala and Ortiz-Boix' experiment consisted of a language pair with two more closely related languages (Catalan-Spanish) than our language pair (English-Dutch). A new analysis would have to be conducted to assess how our results compare with those of Matamala and Ortiz-Boix (2016) on an individual sentence level. On a word level, however, our error rate of 8.16% with 520 marked errors among 6,374 translated words is in line with the findings of Matamala and Ortiz-Boix (ibid.) who found 11% for the RBMT and 5.56% for the SMT engine, averaging an error rate of 8%. than only the one discussed as an example. This also means that in some of the blocks in the examples below, the error discussed is the only error present, which may lead one to believe the number of errors is generally rather low. On average, however, the AD blocks contain 2.3 errors, and only 30.3% of the AD blocks in this case study contain no errors at all. In other words, a considerable number of AD blocks contain more than one error, which is illustrated in example 18 in which five errors occur: two spelling errors (apostrophe s in "Rachel's" and "vader's"), an omission ("concealed"), an unidiomatic translation ("gouden staven" for "gold ingots" instead of the idiomatic "goudstaven"), and the awkward construction that is discussed in the example itself.
As mentioned above, the MT often misinterprets the multiple meanings of homonyms and returns a correct translation of the word itself, but incorrect in the immediate context of the word, as in the example below: The word "corporal" can refer both to the military function of a person as well as an adjective to refer to the human body. In the context of the film Zwartboek, corporal refers to a character, while it was erroneously translated into Dutch as "lichamelijk", referring to the human body. The verb "sniffs" in the same sentence was incorrectly translated as well by the Dutch noun "snuffels", referring to the sound that is made by the nose while sniffing. The physical snuffles In some cases, the mistranslation is not as apparent as in the previous example, as illustrated in example 2 below. Typical for AD is the multimodality of the medium (see section 2). Certain errors were not immediately identified as a mistranslation by the evaluators, but only later on while consulting the multimodal context. In the example below, "jumps" in the original text refers to a sudden movement caused by surprise. The MT, however, contains a Dutch word referring to one of the verb's other meanings, i.e., to jump in the air. In addition to this mistranslation, the noun "pace" has been omitted in the MT. EN (original) Boelie steps into a bedroom and sees his mother passed out on the bed, an empty bottle of wine beside her.

EN (back translation)
Boelie steps into a bedroom and sees his mother passing out on the bed, an empty bottle of wine beside her.
Distinctive for the language of English AD is the use of the present participle to describe simultaneous actions (Salway, 2007;Remael and Vercauteren, 2010). While the use of the present participle is also common in Dutch AD, it is rather uncommon in general texts or speech (Reviers, 2018a where DeepL mistranslated the simultaneity with, for instance, a wrong nominalization (example 5), or using a different word category (in example 6, the present participle is translated as a preposition).
Evaluating the effectiveness of machine translation of audio description: the results of two pilot studies in the English-Dutch language pair

EN (back translation)
Beyond a mirror she sees herself and averts her eyes.
However, DeepL also translated several sentences with the present participle correctly, finding an equivalent construction in Dutch that does not contain the present participle.
In example 7, the verb has been replaced by a locative adverb, resulting in a natural sounding Dutch phrase. Zwartboek (Verhoeven, 2006), Vercauteren and Remael (2010) found that constructions in the English AD containing 'as' to express simultaneity were often translated incorrectly, too. This seems to be no different in machine translation. Most frequently, DeepL translated 'as' by 'als', which entails a shift in meaning from simultaneity to conditional (example 8). However, in some AD blocks 'as' was translated correctly by 'terwijl', which according to Reviers (2018a) occurs with a high frequency in Dutch AD (example 9).

8
Zwartboek 24 EN (original) As the bomber flies overhead it drops another bomb NL (MT) Als de bommenwerper overvliegt laat hij een andere bom vallen.

EN (back translation)
If the bomber flies overhead, it drops another bomb.

9
Het leven is vurrukkulluk 77 EN (original) As they wait for Boelie, Mees leans in to give Panda a kiss, but she seems distracted by something.

EN (back translation)
While they are waiting for Boelie, she leans in to give Panda a kiss, but she seems distracted by something.
A final, less frequent, but, nevertheless, remarkable type of mistranslation is the creation of non-existent words by DeepL. This can be attributed to the characteristics of NMT, as NMT systems can operate on the level of subword units, as opposed to wordlevel MT models (Macken et al., 2020).
10 Blind 1-24 EN (original) Catherine eyes her through a lorgnette then nods and stands aside to let Marie in.

EN (back translation)
Catherine looks her through a lorry net then nods and stands aside to let Marie in.

Fluency
All errors referring to issues related to the form or content of a text that are not directly related to the accuracy of its translation were classified under the category Fluency (DQF-MQM). As shown in table 4, most annotated fluency errors belong to the subtype Evaluating the effectiveness of machine translation of audio description: the results of two pilot studies in the English-Dutch language pair Revista Tradumàtica 2021, Núm. 19 242 grammar (79.4%), as compared to spelling (8.2%) and punctuation (12.4%). Therefore, we will limit the examples to those with grammatical errors, such as word order errors, errors of subject-verb agreement, missing/incorrect preposition, wrong articles, and incorrect possessive case of names, which all undermine the fluency of the target text.
In this section, each example consists of an example number, the title of the movie, the AD block, the original English sentence, the Dutch MT, and the corrected version of the Rachel zwemt door de duisternis langs de oever van de rivier.
Another example is wrong articles or anaphora. In Example 15, the Duch article "de" is used instead of "het", while in example 16 the article "het" is used while it should be "hem". Hij staat op en schudt Rachels hand.
After wrong word order, wrong agreement was the dominant error category in the research of Matamala and Ortiz-Boix (2016). However, in this study, wrong articles and prepositions were categorized as wrong agreement as well. Moreover, Matamala and Ortiz-Boix' study (2016) focused on a language pair where grammatical gender is more pronounced than in the English-Dutch language pair.

Style and Other
Two smaller error types are Style and Other, with 44 and 24 errors respectively. The Style error category contains the awkward and unidiomatic translation error subtypes, which were reported by the evaluators to be hard to distinguish and often rather subjective. The concatenation of two identical possessive structures in example 18 is not necessarily incorrect, disregarding the spelling errors, but is unusual and awkward in Dutch. A combination of different possessive structures, e.g., a possessive 's combined with "van" (English: "of") + owner, would be more suitable here in Dutch. The unidiomatic translation in example 19 is a correct albeit literal translation of "judge his reaction".
However, the Dutch language has an idiomatic verb plus noun combination to express this, which is why "beoordelen" has been replaced by "peilen" in the post-edited sentence.

18
Zwartboek 82 EN (original) Rachel's father's wallet is taken and some gold ingots concealed in her mother's clothing are taken too.
Evaluating the effectiveness of machine translation of audio description: the results of two pilot studies in the English-Dutch language pair Marie kijkt naar Ruben om zijn reactie te peilen.
We grouped the subtypes mistake in source and unnecessary translation in Other, the error category for miscellaneous errors. Mistake in source contains errors caused by the translation of an incorrect word in the source text. In example 20, "gets up" was misspelled in the original English text as "gest up", which was then translated to Dutch based on the English word "gestures", leading to a translation error as a result of a mistake in the source. Unnecessary translations were prevalent in the opening credits, where names, (broadcasting) companies and international terminology were part of the AD script. In example 21, this occurred with the erroneous translation of the name "Marina" as "Jachthaven", a Dutch word used to refer to a yacht wharf.

Conclusion
The present article contributes to research on the machine translation of AD. An overview of the existing literature clearly shows that limited research in this field has been undertaken to date, but that it may prove very useful in supporting the future development of this practice. AD is a text type with unique features, such as the multimodal nature of the source text and the intersemiotic dimension underlying the initial translation of that text. The impact of such characteristics on the machine translation output of AD is yet to be explored in detail. The present case study has highlighted several relevant issues that could form a basis to stimulate further research in this area.
A preliminary observation is that whereas the idiosyncratic language of AD with its relatively short and simple sentences was initially thought to be a good candidate for NMT, this case study has brought to light several challenges. First, the NMT output demonstrates a significant error rate: 520 marked errors on a target text of 6,374 words, with 69.7% of all translated AD blocks containing one or more errors and an average of 2.3 errors per AD block. This points to a clear need for post-editing to bring the text up to the required quality standards. The extent to which post-editing is necessary and the effort required to correct the mistakes will have to be studied through experimental research, but the present case study already highlights that the multimodal context in which the translation is used might impact and potentially complicate the post-editing process. Certain mistakes are related to the multimodal context of the AD and can only be determined when also consulting the original images. The extent to which and how the original images are/can be consulted in the AD post-editing process, still needs to be studied.
A second observation is that the present study offers a first glimpse into the types of errors that occur most frequently. Particularly the categories of Accuracy/Mistranslation and Fluency/Grammar were most noticeable. As mentioned in the analysis, the main types of identified errors seem to suggest that in the course of the translation process, the NMT system often does not or cannot take into account the immediate context sufficiently enough and as a result it fails to arrive at adequate translations. Research into context-aware models for neural machine translation is progressing (see for example Tiedemann and Scherrer, 2017;Bawden et al., 2018;Voita, et al., 2018). The present case-study identified various errors that can be attributed to a lack of context for the translation, leading to mistranslations (by misinterpreting the multiple meanings of homonyms, for instance) and fluency errors (such as verb tense misinterpretations, incorrect anaphora, no subject-verb agreement and errors in word order). Further research will have to look whether such problems can be solved when more context-aware models are integrated in the NMT systems. Furthermore, in the case of AD the multimodal context in which the translation operates (sound effects and dialogues) and the original images of the film on which the AD is based need to be taken into account. As the case study indicates, not all mistranslations seem incorrect at first glance. A specific type of mistranslation is mistranslation due to the multimodal context, which is an extra dimension of context that the MT cannot (yet) consider in the translation process. The MT may output a correct sentence (cf. Panda puts the photo on top of the cabinet -Panda legt de foto bovenop de kast), but the visual content that is part of the source text is necessary to distinguish the exact meaning, and consequently the correct translation of the sentence, in this case "zet" (used in Dutch when something is put somewhere vertically) rather than "legt" (used in Dutch when something is put somewhere horizontally). Previous studies also mentioned in this respect the translation of cultural references (see section 2). While this was not a frequent issue in the present case study (which might be due to the limited size of the sample), we did encounter some examples of translation errors of cultural references due to the lack of multimodal context. In this respect, researchers and developers are exploring the new domain of Multimodal Machine Translation (see Sulubacak et al. (2020) for a recent overview), which could potentially prove very useful to improve MT performance for multimodal text types such as AD since these systems would be able to take into account the multimodal context. not been trained for AD specifically. Several types of errors result from the occurrence of AD-specific linguistic constructions (such as the frequent use of the present participle to express simultaneity, which research has shown is a typical feature of AD language; Reviers, 2018a;Salway, 2007). Particularly in the language pair English-Dutch, the use of the present participle and "as" to express simultaneity pose problems for MT. In Dutch speech and (non-AD) texts for general purposes, the use of the present participle is rather uncommon, which in this case study had a significant influence on the NMT engine and its performance when translating English AD into Dutch. An interesting area for further research, therefore, would be the development of MT systems trained for AD and an evaluation of the extent to which this improves the MT's performance.
Finally, previous studies also underlined the issues of timing and sentence length.
Some languages might result in longer translations than the original. In AD, however, this can become problematic when the translated AD no longer fits into the pauses between dialogue and sound effects. This issue did not occur in this case study, which might be due to the limited sample size or indicate that this issue is less apparent in the English-Dutch language combination, compared to other languages, such as English and Spanish.
While the present article points to several interesting avenues for further research, it is only a preliminary case study with various limitations. First, the same error-type analyses should be conducted on larger samples and compared with other language pairs to corroborate the results. In addition, this case study included fiction drama films only, and the MT performance may differ for other types (documentary, corporate videos) and other genres (horror, comedy).
In addition, MT systems could be compared to human translations to identify not only differences in the types of errors but also in terms of style and norms. As scholars have mentioned (see section 2), AD norms and guidelines differ across countries and languages and as a result an MT translation might be correct, but not acceptable to the target audience. An analysis of the MT output's acceptability with target users is therefore a crucial consideration. Finally, more studies need to be conducted on measuring the actual post editing effort, in particular for the language pair English-Dutch, to complement existing findings and evaluate the actual effort and the impact of the multimodal context.
The case study presented in this article was a pilot to a four-year PhD project in Translation Studies, funded by the University of Antwerp (2020-2024), in which these issues will be explored in more detail. In the first phase, the project will analyse the types of errors in a corpus of Dutch ADs, comparing the MT with the human translations of the same text. In the second phase, an experiment will be conducted to measure and compare the post-editing effort with translation of the original AD. This way we hope to shed more light on this new form of machine translation that will grow as more and more countries will be obliged to make more audio-visual content accessible to people with sight loss.