From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation

Danni Liu; Changhan Wang; Hongyu Gong; Xutai Ma; Yun Tang; Juan Pino

doi:10.21437/Interspeech.2022-10568

From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation

Danni Liu^*, Changhan Wang, Hongyu Gong, Xutai Ma, Yun Tang, Juan Pino

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference article in proceeding › Academic › peer-review

Abstract

Speech-to-speech translation (S2ST) converts input speech to speech in another language. A challenge of delivering S2ST in real time is the accumulated delay between the translation and speech synthesis modules. While recently incremental text-to-speech (iTTS) models have shown large quality improvements, they typically require additional future text inputs to reach optimal performance. In this work, we minimize the initial waiting time of iTTS by adapting the upstream speech translator to generate high-quality pseudo lookahead for the speech synthesizer. After mitigating the initial delay, we demonstrate that the duration of synthesized speech also plays a crucial role on latency. We formalize this as a latency metric and then present a simple yet effective duration-scaling approach for latency reduction. Our approaches consistently reduce latency by 0.2-0.5 second without sacrificing speech translation quality.

Original language	English
Title of host publication	Proceedings of INTERSPEECH 2022
Publisher	International Speech Communication Association (ISCA)
Pages	1771-1775
Number of pages	5
Volume	2022-September
DOIs	https://doi.org/10.21437/Interspeech.2022-10568
Publication status	Published - 2022
Event	23rd Annual Conference of the International Speech Communication Association - Incheon, Korea, Republic of Duration: 18 Sept 2022 → 22 Sept 2022 Conference number: 23 https://www.interspeech2022.org/

Publication series

Series	Interspeech
ISSN	2308-457X

Conference

Conference	23rd Annual Conference of the International Speech Communication Association
Abbreviated title	INTERSPEECH 2022
Country/Territory	Korea, Republic of
City	Incheon
Period	18/09/22 → 22/09/22
Internet address	https://www.interspeech2022.org/

Keywords

speech translation
text-to-speech
low-latency

Access to Document

10.21437/Interspeech.2022-10568

https://isca-speech.org/archive/pdfs/interspeech_2022/liu22u_interspeech.pdf

Cite this

Liu, D., Wang, C., Gong, H., Ma, X., Tang, Y., & Pino, J. (2022). From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation. In Proceedings of INTERSPEECH 2022 (Vol. 2022-September, pp. 1771-1775). International Speech Communication Association (ISCA). https://doi.org/10.21437/Interspeech.2022-10568

@inproceedings{e08ba8cd72b34e2099daff46d211ea7e,

title = "From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation",

abstract = "Speech-to-speech translation (S2ST) converts input speech to speech in another language. A challenge of delivering S2ST in real time is the accumulated delay between the translation and speech synthesis modules. While recently incremental text-to-speech (iTTS) models have shown large quality improvements, they typically require additional future text inputs to reach optimal performance. In this work, we minimize the initial waiting time of iTTS by adapting the upstream speech translator to generate high-quality pseudo lookahead for the speech synthesizer. After mitigating the initial delay, we demonstrate that the duration of synthesized speech also plays a crucial role on latency. We formalize this as a latency metric and then present a simple yet effective duration-scaling approach for latency reduction. Our approaches consistently reduce latency by 0.2-0.5 second without sacrificing speech translation quality.",

keywords = "speech translation, text-to-speech, low-latency",

author = "Danni Liu and Changhan Wang and Hongyu Gong and Xutai Ma and Yun Tang and Juan Pino",

note = "Publisher Copyright: Copyright {\textcopyright} 2022 ISCA.; 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 ; Conference date: 18-09-2022 Through 22-09-2022",

year = "2022",

doi = "10.21437/Interspeech.2022-10568",

language = "English",

volume = "2022-September",

series = "Interspeech",

publisher = "International Speech Communication Association (ISCA)",

pages = "1771--1775",

booktitle = "Proceedings of INTERSPEECH 2022",

address = "France",

url = "https://www.interspeech2022.org/",

}

Liu, D, Wang, C, Gong, H, Ma, X, Tang, Y & Pino, J 2022, From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation. in Proceedings of INTERSPEECH 2022. vol. 2022-September, International Speech Communication Association (ISCA), Interspeech, pp. 1771-1775, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, Republic of, 18/09/22. https://doi.org/10.21437/Interspeech.2022-10568

From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation. / Liu, Danni; Wang, Changhan; Gong, Hongyu et al.
Proceedings of INTERSPEECH 2022. Vol. 2022-September International Speech Communication Association (ISCA), 2022. p. 1771-1775 (Interspeech).

Research output: Chapter in Book/Report/Conference proceeding › Conference article in proceeding › Academic › peer-review

TY - GEN

T1 - From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation

AU - Liu, Danni

AU - Wang, Changhan

AU - Gong, Hongyu

AU - Ma, Xutai

AU - Tang, Yun

AU - Pino, Juan

N1 - Conference code: 23

PY - 2022

Y1 - 2022

N2 - Speech-to-speech translation (S2ST) converts input speech to speech in another language. A challenge of delivering S2ST in real time is the accumulated delay between the translation and speech synthesis modules. While recently incremental text-to-speech (iTTS) models have shown large quality improvements, they typically require additional future text inputs to reach optimal performance. In this work, we minimize the initial waiting time of iTTS by adapting the upstream speech translator to generate high-quality pseudo lookahead for the speech synthesizer. After mitigating the initial delay, we demonstrate that the duration of synthesized speech also plays a crucial role on latency. We formalize this as a latency metric and then present a simple yet effective duration-scaling approach for latency reduction. Our approaches consistently reduce latency by 0.2-0.5 second without sacrificing speech translation quality.

AB - Speech-to-speech translation (S2ST) converts input speech to speech in another language. A challenge of delivering S2ST in real time is the accumulated delay between the translation and speech synthesis modules. While recently incremental text-to-speech (iTTS) models have shown large quality improvements, they typically require additional future text inputs to reach optimal performance. In this work, we minimize the initial waiting time of iTTS by adapting the upstream speech translator to generate high-quality pseudo lookahead for the speech synthesizer. After mitigating the initial delay, we demonstrate that the duration of synthesized speech also plays a crucial role on latency. We formalize this as a latency metric and then present a simple yet effective duration-scaling approach for latency reduction. Our approaches consistently reduce latency by 0.2-0.5 second without sacrificing speech translation quality.

KW - speech translation

KW - text-to-speech

KW - low-latency

U2 - 10.21437/Interspeech.2022-10568

DO - 10.21437/Interspeech.2022-10568

M3 - Conference article in proceeding

VL - 2022-September

T3 - Interspeech

SP - 1771

EP - 1775

BT - Proceedings of INTERSPEECH 2022

PB - International Speech Communication Association (ISCA)

T2 - 23rd Annual Conference of the International Speech Communication Association

Y2 - 18 September 2022 through 22 September 2022

ER -