Making LLMs Better Many-to-Many Speech-to-Text Translators with Curriculum Learning: A Technical Overview
Authors: Yexing Du, Youcheng Pan, Ziyang Ma, Bo Yang, Yifan Yang, Keqi Deng, Xie Chen, Yang Xiang, Ming Liu, Bing Qin
arXiv: 2409.19510v2
Introduction
Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance in Speech-to-Text Translation (S2TT) tasks, especially in English-centric scenarios. However, scaling these models to many-to-many translation directions is hampered by the scarcity of parallel speech-text data for most language pairs. This paper introduces a three-stage curriculum learning strategy that leverages the machine translation (MT) capabilities of LLMs and adapts them for S2TT, enabling robust performance even in low-resource settings.
Problem Statement
Traditional S2TT systems typically use a cascaded approach:
- ASR (Automatic Speech Recognition): Transcribe speech to text.
- MT (Machine Translation): Translate the transcribed text to the target language.
While effective, this pipeline suffers from error propagation and increased latency. End-to-end MLLMs can mitigate these issues but require large-scale parallel S2TT data, which is often unavailable for many language pairs.
Proposed Solution: Curriculum Learning for S2TT
The authors propose reframing S2TT as a Speech Recognition and Translation (SRT) task, where the model is trained to output both the transcription and the translation from speech input. The core innovation is a three-stage curriculum learning strategy:
1. ASR Stage
- Objective: Train the MLLM for multimodal alignment and transcription.
- Input: Speech + instruction.
- Output: Transcription.
2. Speech-Aided Machine Translation (SMT) Stage
- Objective: Enhance cross-lingual capabilities by providing both speech and its transcription as input, prompting the model to generate translations.
- Input: Speech + transcription + instruction.
- Output: Translation.
3. SRT Stage
- Objective: Finalize the model for S2TT by training it to generate both transcription and translation from speech alone.
- Input: Speech + instruction.
- Output: Transcription + translation.
Each stage resumes from the checkpoint of the previous one, ensuring knowledge transfer and stable optimization.
Model Architecture: LLM-SRT
The LLM-SRT model consists of:
- Speech Encoder: A frozen Whisper encoder extracts high-dimensional features from audio.
- Speech Adapter: A Q-Former compresses and projects speech features to match the LLM’s hidden dimension, enabling efficient multimodal fusion.
- LLM Backbone: Qwen2.5 (3B, 7B, 32B) serves as the language model, processing concatenated speech and text embeddings.
Instruction Design: Minimalist, language-tagged instructions (e.g., <|eng|><|zho|>
) are used to distinguish tasks and segment outputs.
Experimental Setup
- Datasets:
- FLEURS: 102 languages, ~10 hours of speech per language (low-resource).
- CoVoST-2: Large-scale multilingual S2TT corpus (high-resource).
- Baselines:
- Cascaded ASR+MT systems (Whisper + Qwen2.5).
- End-to-end models: SeamlessM4T-V2, Qwen2-Audio.
- Metrics:
- WER for ASR.
- BLEU for S2TT.
Results
Low-Resource Setting (FLEURS)
- LLM-SRT-3B achieves a BLEU score of 20.6 (vs. 8.6 for baseline-3B and 20.2 for SeamlessM4T-V2).
- LLM-SRT-32B achieves 24.6 BLEU, outperforming all baselines.
- The model supports 15 × 14 translation directions with less than 10 hours of speech per language.
High-Resource Setting (CoVoST-2)
- LLM-SRT models scale well with more data, achieving competitive or superior BLEU scores compared to state-of-the-art models.
Ablation Studies
- Removing any stage from the curriculum (ASR, SMT, or SRT) leads to significant performance drops, confirming the necessity of the three-stage approach.
- Training only the speech adapter (freezing the LLM) yields strong results; unfreezing the LLM (e.g., via LoRA) further improves performance.
Inference Speed
- The optimized speech adapter enables up to 3x faster inference compared to Qwen2-Audio, with significant reductions in input token length.
Key Takeaways
- Curriculum learning is highly effective for adapting LLMs to many-to-many S2TT, especially in low-resource scenarios.
- The SRT formulation bridges the gap between MT and S2TT, leveraging the LLM’s translation capabilities.
- Scaling laws hold: larger LLMs yield better S2TT performance.
- The approach is robust across both low- and high-resource settings, and supports a wide range of languages.
Limitations
- The method’s performance is bounded by the underlying LLM’s MT capabilities and language coverage.
- Languages not well-supported by the LLM or with poor MT performance remain challenging.
Conclusion
This work demonstrates that curriculum learning can unlock the many-to-many S2TT potential of MLLMs, even with limited parallel data. By systematically transferring MT capabilities to S2TT via ASR, SMT, and SRT stages, the proposed LLM-SRT model achieves state-of-the-art results across a wide range of languages and data regimes. The code and models are available at github.com/yxduir/LLM-SRT.
References:
For a full list of references and technical details, see the original paper.
If you found this summary useful, follow for more technical deep-dives into cutting-edge AI research!