Baidu Research Announces Breakthrough
in Simultaneous Translation
By Baidu Research
October 24, 2018
Today, we are excited to announce STACL
(Simultaneous Translation with Anticipation and Controllable
Latency), the first simultaneous machine translation system
with anticipation capabilities and controllable latency. It
is an automated system that is able to conduct high quality
translation concurrently between two languages. STACL
represents a major breakthrough in natural language
processing due in large part to the challenges presented by
word order differences between the source and target
languages, and the latency requirements in real-world
applications of simultaneous translation or interpretation.
Historically, there have been two types of
Historically, there have been two types of interpretation:
1.Consecutive interpretation refers to a practice where the translator waits until the speaker pauses (usually at sentence boundaries) to start translating, thus doubling the time needed.
2.Simultaneous interpretation is when the translator begins translating just a few seconds into the speaker's speech and finishes just a few seconds after the speaker ends.
Thanks to its speed, simultaneous interpretation has been widely used for intergovernmental summits, multilateral negotiations and many other occasions. The benefits of simultaneous translation has created a huge demand for this service but there are not nearly enough simultaneous interpreters, and each person can only last for 20-30 minutes, after which their error rates grow exponentially. That's why simultaneous interpreters work in teams of two or three and usually alternate with each other every 20-30 minutes.
Therefore, there is a critical need to develop automated systems to expand the access to simultaneous translation.
Creating an automated system for reliable simultaneous translation has been a long standing challenge in the field, especially due to the word order differences between the source and the target languages. For example, in a Chinese sentence Bùshí zǒngtǒng zài Mòsīkē yǔ Éluósī zǒngtǒng Pǔjīng huìwù, which means “President Bush meets with Russian President Putin in Moscow”, the Chinese verb huìwù (“meet”), appears at the very end, similar to a German or Japanese verb. In the English translation, however, the verb “meets” appears much earlier. This variance in word order in human languages has been a major hindrance to both human simultaneous interpreters and the development of reliable simultaneous machine translation systems. As a result, virtually all commercial “real-time” translation systems still today use conventional full-sentence (i.e., non-simultaneous) translation methods, causing the undesirable latency of at least one sentence, rendering the user out of sync with the speaker.
this challenge using an idea inspired by human simultaneous
interpreters, who routinely anticipate or predict materials
that the speaker is about to cover in a few seconds into the
future. However, different from human interpreters, our
model does not predict the source language words in the
speaker’s speech but instead directly predict the target
language words in the translation, and more importantly, it
seamlessly fuses translation and anticipation in a single
“wait-k” model. In this model the translation is always k
words behind the speaker’s speech to allow some context for
prediction. We train our model to use the available prefix
of the source sentence at each step (along with the
translation so far) to decide the next word in translation.
In the aforementioned example, given the Chinese prefix
Bùshí zǒngtǒng zài Mòsīkē (“Bush President in Moscow”) and
the English translation so far “President Bush” which is k=2
words behind Chinese, our system accurately predicts that
the next translation word must be “meet” because Bush is
likely "meeting" someone (e.g., Putin) in Moscow, long
before the Chinese verb appears. Just as human interpreters
need to get familiar with the speaker’s topic and style
beforehand, our model also needs to be trained from vast
amount of training data which have similar sentence
structures in order to anticipate with a reasonable
STACL is also
flexible in terms of the latency-quality trade-off, where
the user can specify any arbitrary latency requirements
(e.g., one word delay or five word delay). Between closely
related languages such as French and Spanish, the latency
can be set lower because even word-by-word translation works
very well. However, for distant languages such as English
and Chinese and languages with different word order such as
English and German, the latency should be allowed higher to
cope with the word order differences. It is more common for
translation quality to suffer with low latency requirements,
but our system sacrifices only a small loss in quality
compared to conventional full-sentence (e.g.
non-simultaneous) translation. We are continuing to improve
translation quality given low latency requirements.
While the best human simultaneous interpreters are reported to cover about 60% of the source materials (with about three seconds delay), the new simultaneous translation system from Baiduis about 3.4 BLEU points less than conventional full-sentence translation, where BLEU is the standard evaluation metric for full-sentence translation quality by comparing a machine translation result with a human reference translation. In Chinese-to-English simultaneous translation with a wait-5-words model (where the English translation is lagging behind the Chinese speech by 5 Chinese words, or about 3 seconds), the translation quality is 3.4 BLEU points lower than full-sentence (non-simultaneous) translation.
Even with the latest advancement, we are fully aware of the many limitations of a simultaneous machine translation system. The release of STACL is not intended to replace human interpreters, who will continue to be depended upon for their professional services for many years to come, but rather to make simultaneous translation more accessible.