Sustaining Disruption ? The Transition from Statistical to Neural Machine Translation

If statistical machine translation (SMT) was a disruptive technology, then neural machine translation (NMT) is probably a sustaining technology, continuing on a trajectory already established by SMT, and initially evaluated in much the same way as its predecessor. Seeing NMT in this light may be a useful corrective to the hype that has surrounded its introduction.


Introduction
In 1997 Clayton M. Christensen published what was to become one of the most influential business books of the era: The Innovator's Dilemma is included in most major media lists of 'best business books' including those published by Time Magazine and The Economist, for example, and statistics provided by Kilkki et al. (2018) show how the terms used by Christensen gained widespread currency after the book's publication.In The Innovator's Dilemma Christensen set out to explain why "great" companies sometimes fail.Well-managed companies sometimes falter, he argued, not because other companies come along who can offer better products, but because they are displaced by new entrants who, in the short-term at least, offer products that result in worse performance than those of the incumbent.The explanation was originally based on the idea of 'disruptive technologies.'Disruptive technologies, according to Christensen (1997: xv) (Christensen and Raynor, 2003: 66).The shift to the term 'disruptive innovation' is also an acknowledgement of the fact that technologies by themselves are not inherently disruptive.Rather it is the combination of business model and product, among other things, that can disrupt.The approach is thus generally consistent with that adopted in science and technology studies, which is beginning to make its mark in translation circles (Olohan, 2017), and which rejects the idea of autonomous technologies that are capable of independently determining social outcomes.
But even 'disruptive innovation' quickly became prone to "loose" usage, and disruption theory risks, according to its author, becoming a victim of its own success (Christensen, Raynor and McDonald, 2015: 46) with some commentators arguing that "the word "disruption" is now bandied about so much that it is losing all meaning" (The Economist, 2015 Notwithstanding the fact that contemporary understandings of 'disruption' have become diffuse, and even if Christensen's own approach is not without its detractors (see Kilkki et al.,ibid: 276), there is much food for thought in his original conception between disruptive and other kinds of innovation, can, according to Christensen, help managers to manage innovation or defend their companies in the face of certain challenges.Likewise, looking at changes in the translation market, and especially changes linked to the increasing use of machine translation (MT), through a Christensen-inspired disruption lens, might help providers of human and machine translation services alike to understand better the nature of the changes that accompany new types of MT and to find solutions that are appropriate to their own contexts.Of particular value are: the distinction between sustaining and disruptive innovations; the distinction between low-end and new-market disruption; the recognition that disruptive innovations are often accompanied by a shift in the main metric used to assess a product or service; and the classification of strategies to deal with disruption.These elements are dealt with in turn below, after a brief overview of how disruption proceeds in general.
Before moving on, however, it is worth mentioning some of the criticisms of Christensen's work.Jill Lepore, one of his most strident critics, has accused Christensen of "hand-picking" examples, and claims that his sources are often "dubious" and his logic "questionable" (Lepore, 2014).Lepore also decries the whole rhetoric of disruption as "a language of panic, fear, asymmetry, and disorder" (ibid.),claiming that disruption is a theory borne of anxiety.It is this claim that is of particular interest in the current context.There is no doubt that the frenzied discourse of disruption subsequently encountered in some quarters could induce anxiety, especially among those who risk being 'disrupted', but I would argue that Christensen's original approach to disruption is anything but frenzied.By comparison to the kinds of sources to which Lepore (ibid.)alludes, Christensen's writing is positively staid, and provides a welcome antidote to anxiety-inducing hype.
In what follows, Christensen's major concepts will be applied to the rise of rulebased, statistical and neural machine translation.Statistical machine translation (SMT) was state of the art in machine translation from the mid-2000s through to 2015, when it was displaced by neural machine translation (NMT) (see Bentivogli et al., 2016aBentivogli et al., , 2016b)).The hype surrounding NMT in particular has since reached fever pitch, with industry sources claiming that NMT has already attained 'parity' with human translation (Hassan et al., 2018) and that it now has the potential to replace human translators (Shoshan, 2018a), and both industry and academic commentators writing about 'disruption' on a scale not seen before in translation circles (see Shoshan, 2018b), or that at least risks being underestimated (Way, 2018).A re-engagement with Christensen's concept of disruption thus seems timely.Christensen's (1997) model distinguishes between sustaining and disruptive technologies.

Disruptive Innovation
The former "improve the performance of established products, along the dimensions of performance that mainstream customers in major markets have historically valued" (ibid.xv).In one of his best-known case studies, that of the disk drive industry, Christensen argues that technological innovations that resulted in improvements in total capacity and recording density (the latter measured in megabits per square inch) initially sustained the position of the leading manufacturers in the industry.When the upheaval came that toppled those leaders, it came in the form of disk drives that performed worse than the incumbents on these established metrics.The 5.25-inch drive, for example, was inferior to the 8-inch drive it eventually displaced, from the point of view of capacity, cost per megabyte and access time (ibid.,15),but it was small and lightweight and thus appealed to the emerging market for personal desktop computers.
But, as already noted, disruption is not just about new technologies, it's also about business models and how a company's customers respond to changing environments.
In their 2015 reprise of disruption theory, Christensen, Raynor and McDonald summarize the process as follows: "Disruption" describes a process whereby a smaller company with fewer resources is able to successfully challenge established incumbent businesses.
Specifically, as incumbents focus on improving their products and services for their most demanding (and usually most profitable) customers, they exceed the needs of some segments and ignore the needs of others.Entrants that prove disruptive begin by successfully targeting those overlooked segments, gaining a foothold by delivering more-suitable functionality-frequently at a lower price.
Incumbents, chasing higher profitability in more-demanding segments, tend not to respond vigorously.Entrants then move upmarket, delivering the performance that incumbents' mainstream customers require, while preserving the advantages that drove their early success.When mainstream customers start adopting the entrants' offerings in volume, disruption has occurred.(Christensen, Raynor and McDonald, 2015: 46) Christensen's model also attempts to capture the different strategies that incumbents can use to respond to disruption.In short, they can choose (among other options) to migrate upward (and risk becoming uncompetitive with their existing customers) or downward, although he acknowledges that building a cogent case "for entering small, poorly defined low-end markets that offer only lower profitability" does not come easily to "[r]ational managers" (1997: 77).

Disruption and Machine Translation
An admittedly loose analogy can be made with the translation industry.For decades MT was not seen as a competitive threat by human translators because its linguistic quality could not compare with that of human translation.But when criteria such as 'ubiquity, mobility, connectivity, and immediacy' (Enríquez Raído, 2013) came to be increasingly valued in the networked economy, and MT could effectively be delivered for free online, it became clear that a particular group of consumers, who were perhaps not likely to have paid for a translation in the first place, could live with 'disrupted' human translation in Christensen's use of the term: there was no great blood-letting in human translation because of it, and it can be argued that translation memory had a much greater impact than RBMT on human translation in the 1990s.
But in both the introduction of online RBMT and the subsequent rise of SMT, there were elements of Christensen-style disruptions.Free online RBMT competed against 'non-consumption.'In other words, it did not displace incumbents (e.g.human translators with their translation memory tools), but rather helped create a new market.
The 'low-end' market thus created was subsequently taken over by SMT.In the latter case, the new entrant technology was initially considerably inferior to the incumbent (RBMT) along well-established performance dimensions.When SMT was first presented as an alternative to RBMT by pioneers at IBM, for example, they were not even able to provide "actual results of French/English translation" (Brown et al., 1988: 1), and early models were extremely naïve (see Koehn, 2010).But the upstart technology eventually disrupted the incumbent along all of the dimensions recognised by Kilkki et al. (2018).
In particular when 'mainstream' customers started adopting SMT, disruption can be said to have occurred within Christensen's framework.(A full account of the recent history of machine translation is beyond the scope of this article.Interested readers are referred to brief treatments in Kenny (2018) and Poibeau (2017).) In building his arguments, Christensen plots mainstream performance metrics against time to show the trajectories of sustaining and disruptive technologies.He also inserts dotted lines to represent the average performance of incumbent technologies in different markets or market segments.This average performance then serves as a proxy measure for the performance demanded in that market.As we have already seen, one of Christensen's main arguments is that leading companies are often so focused on serving their high-end customers' needs, that over time they actually exceed those needs, while simultaneously ignoring the needs of less demanding customers.
If Christensen's model was transferred to the translation industry, and the 'technologies' in question were classified as human translation in combination with computer-aided translation (HT/CAT), RBMT, SMT and NMT, and dotted lines were further added to depict performance demanded in the different 'segments' of the market, the graph might look something like Figure 1.We would also need to agree on the precise metric to be depicted on the y-axis.This is easier said than done.Should we prioritize speed, cost or quality, for example?Or do we need a graph for each of these criteria?If we are most interested in quality of target texts produced using different technologies, then which metric would best capture this?
4 Quality Metrics Translation Quality Assessment (TQA) is a vast area that has been studied for decades.There are multiple ways to evaluate both human and machine translation, discussion of which goes beyond the scope of this article (for an overview of approaches, see Castilho et al., 2018), and metrics tend to differ between academia, commercial production environments, and MT research and development laboratories.
In production environments, for example, flexible frameworks like MQM-DQF (Lommel,  Uszkoreit and Burchardt, 2014;qt21, 2015) are often preferred and are used in the evaluation of both human and machine translation, while in MT labs, reference-based automatic metrics-of which there is a multitude-have tended to prevail in the evaluation of machine output.(González, Giménez and Girona Salgado (2014), for example, list some 60 automatic evaluation metrics, or variations thereof, based on similarity with a reference translation).
Having said that-and focusing solely on the MT research and development community-if one metric stands out as having risen to predominance in the first part of the current millennium, it has to be bleu (Papineni et al., 2002).Bleu was introduced as SMT gained ground, and computer scientists needed quick, inexpensive, languageindependent methods to evaluate the effects of iterative, sometimes daily, changes to their own MT systems, and to compare multiple systems against each other.The requirement was for an automatic metric that would correlate highly with human evaluation and incur little marginal cost per run, and bleu, an n-gram precision-based metric, was born.Despite known problems with bleu (see, e.g., Way, 2018: 168), it remains "by some distance the most reported metric in papers involving MT experiments" (ibid.), and has been described as the "de facto standard for most research purposes" (Castilho et al., 2018: 26) and the "primary" metric used to rank competing MT systems in shared task evaluations (Bentivogli et al., 2016a: 16).Most notably, when NMT broke through in shared task evaluations in 2015 after a period in which it was "too computationally costly and resource demanding to compete with state-of-the-art Phrase-Based [Statistical] MT" (Bentivogli et al., 2016b: 1), its success was heralded in terms of a bleu score, one that was better than that of the incumbent technology by "a large margin"-+5.3bleu points, to be precise-thus anticipating, according to Bentivogli et al. (ibid.)"what, most likely, will be the new NMT era." (The bleu score in question, incidentally, was reported for English-German, which was known to be "a difficult language pair" (ibid.)).
So NMT took on the mantle of SMT without a shift in metrics.The breakthrough might indeed be era-defining, but from Christensen's point of view, NMT is probably a sustaining rather than a disruptive technology.Its entry in 2015 on the fictional graph in Figure 1 is also thus depicted using a sustaining rather than a disruptive trajectory.
Offering an incremental improvement on SMT-which had already begun to plateaufrom the outset, it was initially judged on the same terms as SMT, as already noted.It has already begun to take over from SMT, in a neat demonstration of classic intersecting technology S-curves (Foster 1986;Christensen 1997), and in the way of all sustaining technologies.This is not to say that the rise of NMT will not be followed eventually by a change in the preferred metric for ranking systems within the academic MT R&D community, or that other metrics, including human evaluation metrics, have not also been applied to NMT.MT specialists are already clamouring for bleu to be abandoned (Way, 2018), partly because it cannot do justice to NMT, and several papers have already been published that apply other, especially human, evaluation metrics to NMT (see, for example, Bentivogli et al. 2016b;Castilho et al., 2017).The point being made is merely

66
that NMT was first crowned the new state-of-the-art in MT using bleu scores as evidence.The attendant hype was such that industry source slator.comsaw fit to explain bleu to its readers, who were no doubt more used to evaluating translation using metrics more commonly used in production environments (Pan, 2016).

The Translator's Dilemma
As in other areas, if there has, indeed, been disruption in the translation market (or markets), and whether or not it has been disruption in the Christensen mould, then the incumbents are faced with a decision: should they hold steady, or attempt to migrate upwards or downwards?And there is no shortage of advice to hand: speaking against a background of widespread automation anxiety, Google's Hal Varian (quoted in Brynjolfsson and McAfee, 2014: 200) counsels human workers in general to "seek to be an indispensable complement to something that's getting cheap and plentiful".In the context of machine translation, it is easiest to interpret such advice as meaning that translators should become post-editors, and Anthony Pym (2013: 488) has confidently predicted that "statistical-based MT, along with its many hybrids, is destined to turn most translators into posteditors one day, perhaps soon".Indeed, post-editing MT was recognised as one of the fastest growing segments of language industry even before NMT had gained widespread use (see, for example, Common Sense Advisory, 2016).But given concerns among post-editors about remuneration and boredom in particular (see Moorkens and O'Brien 2017), such a move risks being seen as a downward migration, and unappealing to many 'rational' incumbents.Meanwhile a strategy of upward migration, or holding steady if one is already serving top-end clients, is promoted by translators' organisations such as the Institute of Translation and Interpreting (ITI) in the United Kingdom and authors such as Moorkens (2017).This position has come under fire from some quarters however, including from Pym (2016), who appears to denigrate the ambitions of those who seek to serve high-end customers, and some translators have found themselves obliged to attest to the very existence of 'high-end' or 'premium' markets (see, for example, Sardon, 2017).A third way that combines upward migration with the use of machine translation is perhaps discernible in ideas about 'augmented translation' as envisaged, for example, by Lommel (2017).Here the human translator is presented not at the end of the translation chain, fixing errors in MT output, but rather at the centre of translation activity, drawing on key technologies-including adaptive NMT-that amplify his/her abilities and speed up the process of translation.

Conclusions
This article has argued that while it is possible to see SMT as a disruptive technology, as described by Christensen (1997), the more recently popular NMT, although greeted as a revolutionary achievement of 'electronic brains' by vendors, bloggers and the press alike, might be better viewed as a sustaining technology.Offering an incremental improvement on SMT from the outset, it was initially judged on the same terms as sometimes unreliable linguistic quality.Performance metrics began to change.A key moment came when the Altavista search engine first made rule-based machine translation (RBMT) available to its users in 1997.Ten years later, another shift happened when Google shifted from RBMT to SMT.It is not self-evident that RBMT

Figure 1 :
Figure 1: Fictional depiction of trajectories of performance demanded vs performance supplied by different translation technologies (inspired by Christensen 1997)