Automated MT evaluation metrics and their limitations

Bogdan Babych

doi:10.5565/rev/tradumatica.70

Mètriques d'avaluació automatitzada de TA i les seves limitacions

Autors/ores

Bogdan Babych School of Computing Faculty of Engineering University of Leeds

PDF (EN)

Resum

Aquest article ofereix una visió general de les principals classes de mètodes d'avaluació automàtica de la qualitat de la Traducció Automàtica (TA), les seves limitacions i el seu valor tant per a traductors professionals com per a desenvolupadors de TA. L'avaluació automàtica de TA es caracteritza per l'actuació dels sistemes de TA amb textos o corpus específics. És d'esperar que els índexs automàtics es correlacionen amb aquells paràmetres que estableixen els avaluadors humans sobre la qualitat de la TA, com ara l'adequació o fluïdesa de la traducció. L'avaluació automàtica actualment és part del cicle de desenvolupament de la TA, i a més també permet fer avançar la investigació fonamental sobre TA i millorar la seva tecnologia.

Paraules clau

traducció automàtica, avaluació, mètodes automatitzats, perspectives futures

Referències

King, M., Popescu-Belis, A., & Hovy, E. (2003). FEMTI: creating and using a framework for MT evaluation. In Proceedings of MT Summit IX, New Orleans, LA (pp. 224-231).

Snover, M., B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul (2006). A Study of Translation Edit Rate with Targeted Human Annotation. In: Proceedings of Association for Machine Translation in the Americas.

Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. ACL-2002: 40th Annual meeting of the Association for Computational Linguistics. pp. 311–318.

NIST (2005). Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics.

http://www.itl.nist.gov/iad/mig//tests/mt/doc/ngram-study.pdf

Banerjee, S. and Lavie, A. (2005) METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In: Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43rd Annual Meeting of the Association of Computational Linguistics (ACL-2005), Ann Arbor, Michigan, June 2005.

Babych, B., and Hartley, A. (2004). Extending the BLEU MT Evaluation Method with Frequency Weightings. Paper presented at the 42nd International Conference of the Association for Computational Linguistics, ACL 2004: Barcelona, Spain.

Babych, B., Hartley A. (2004b). Comparative Evaluation of Automatic Named Entity Recognition from Machine Translation Output. Paper presented at the Workshop on Named Entity Recognition for Natural Language Processing Applications. In Conjunction with the First International Joint Conference on Natural Language Processing IJCNLP-04, Sanya.

Babych, B., D. Elliott, & A. Hartley. (2004): Calibrating resource-light automatic MT evaluation: a cheap approach to ranking MT systems by the usability of their output. LREC-2004: Fourth International Conference on Language Resources and Evaluation, Proceedings, Lisbon, Portugal, 26-28 May 2004; pp.2031-2034.

Babych, B., A. Hartley & D. Elliott (2005). Estimating the predictive power of n-gram MT evaluation metrics across language and text types. In: Proc of MT Summit X, Phuket, Thailand, September 13-15, 2005, Conference Proceedings: the tenth Machine Translation Summit; pp.412-418.

Rajman, M & A. Hartley (2001). Automatically predicting MT systems rankings compatible with fluency, adequacy and informativeness scores. MT Summit VIII, Santiago de Compostela, Spain, 18-22 September 2001. Workshop on MT Evaluation

Sinaiko, H. W. (1979). Measurement of usefulness by performance test. In Van Slype, G. In: Critical Methods for Evaluating the Quality of Machine Translation. Prepared for the European Commission Directorate General Scientific and Technical Information and Information Management. Report BR 19142. Bureau Marcel van Dijk, p.91

Estrella, P., O. Hamon, & A. Popescu-Belis (2007). How much data is needed for reliable MT evaluation? Using bootstrapping to study human and automatic metrics. MT Summit XI, 10-14 September 2007, Copenhagen, Denmark. Proceedings; pp.167-174.

White, J. S., T. A. O'Connell, F. E. O'Mara (1994). The ARPA MT evaluation methodologies: evolution, lessons,and future approaches. Technology partnerships for crossing the language barrier: Proceedings of the First Conference of the Association for Machine Translation in the Americas,5-8 October, Columbia, Maryland, USA

Callison-Burch, Chris, Miles Osborne, & Philipp Koehn. (2006). Re-evaluating the role of BLEU in machine translation research. EACL-2006: 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, April 3-7, 2006; pp.249-256.

Babych, B. & A. Hartley (2008). Sensitivity of automated MT evaluation metrics on higher quality MT output: BLEU vs task-based evaluation methods. LREC 2008: 6th Language Resources and Evaluation Conference, Marrakech, Morocco, 26-30 May 2008; 4pp.

Mètriques d'avaluació automatitzada de TA i les seves limitacions

Autors/ores

Resum

Paraules clau

Referències

DOI

Publicades

Descàrregues