SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations Paper • 2510.25955 • Published Oct 29, 2025
SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation Paper • 2510.14664 • Published Oct 16, 2025
Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration Paper • 2509.19928 • Published Sep 24, 2025 • 1
StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling Paper • 2506.12570 • Published Jun 14, 2025 • 1
Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR Paper • 2409.08797 • Published Sep 13, 2024
Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling Paper • 2505.19669 • Published May 26, 2025
VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining Paper • 2505.21527 • Published May 23, 2025
EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting Paper • 2504.12867 • Published Apr 17, 2025
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis Paper • 2504.10352 • Published Apr 14, 2025
CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought Paper • 2409.19510 • Published Sep 29, 2024
LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR Paper • 2406.06619 • Published Jun 7, 2024
FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching Paper • 2502.11128 • Published Feb 16, 2025
SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training Paper • 2412.15649 • Published Dec 20, 2024 • 1
Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers Paper • 2412.16102 • Published Dec 20, 2024
k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning Paper • 2411.17100 • Published Nov 26, 2024
LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization Paper • 2409.00819 • Published Sep 1, 2024
Zipformer: A faster and better encoder for automatic speech recognition Paper • 2310.11230 • Published Oct 17, 2023 • 1
VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech Paper • 2401.14321 • Published Jan 25, 2024