Towards Machine Speech-to-speech Translation

Authors

  • Satoshi Nakamura Graduate School of Science and Technology Nara Institute of Science and Technology Japan https://orcid.org/0000-0001-6956-3803
  • Katsuhito Sudoh Graduate School of Science and Technology Nara Institute of Science and Technology Japan https://orcid.org/0000-0002-2122-9846
  • Sakriani Sakti Graduate School of Science and Technology Nara Institute of Science and Technology Japan

Abstract

There has been a good deal of research on machine speech-to-speech translation (S2ST) in Japan, and this article presents these and our own recent research on automatic simultaneous speech translation. The S2ST system is basically composed of three modules: large vocabulary continuous automatic speech recognition (ASR), machine text-to-text translation (MT) and text-to-speech synthesis (TTS). All these modules need to be multilingual in nature and thus require multilingual speech and corpora for training models. S2ST performance is drastically improved by deep learning and large training corpora, but many issues still still remain such as simultaneity, paralinguistics, context and situation dependency, intention and cultural dependency. This article presents current on-going research and discusses issues with a view to next-generation speech-to-speech translation.

Keywords

Speech-to-speech translation, automatic speech recognition, machine text-to-text translation, text-to-speech synthesis

References

Chousa, K.; Sudoh, K.; Nakamura, S. (2019). Simultaneous Neural Machine Translation using Connectionist Temporal Classification. ArXiv Preprint, 1911.11933. Retrieved from http://arxiv.org/abs/1911.11933

Do, Q. T.; Sakti, S.; Nakamura, S. (2018). Sequence-to-Sequence Models for Emphasis Speech Translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, v. 26, n. 10, pp. 1873–1883. https://doi.org/10.1109/TASLP.2018.2846402

Kano, T.; Sakti, S.; Nakamura, S. (2017). Structured-Based Curriculum Learning for End-to-End English-Japanese Speech Translation, in: Proceedings of Interspeech 2017, pp. 2630–2634. https://doi.org/10.21437/Interspeech.2017-944

Mizuno, A. (2016). Simultaneous Interpreting and Cognitive Constraints. Journal of College of Literature, Aoyama Gakuin University, n. 58, 1–28. https://www.agulin.aoyama.ac.jp/repo/repository/1000/19723/

Novitasari, S.; Tjandra, A.; Sakti, S.; Nakamura, S. (2019). Sequence-to-Sequence Learning via Attention Transfer for Incremental Speech Recognition, in: Proceedings of Interspeech 2019, pp. 3835–3839. https://doi.org/10.21437/Interspeech.2019-2985

Yanagita, T.; Sakti, S.; Nakamura, S. (2019). Neural iTTS: Toward Synthesizing Speech in Real-time with End-to-end Neural Text-to-Speech Framework, in: Proceedings of the 10th ISCA Speech Synthesis Workshop, pp. 183–188. https://doi.org/10.21437/SSW.2019-33

Published

2023-03-07

Downloads