VibeVoice-Realtime-0.5B Fine-tuned (French SIWIS)
Fine-tuned version of microsoft/VibeVoice-Realtime-0.5B on the French SIWIS dataset for improved French TTS.
Training Details
- Base model: microsoft/VibeVoice-Realtime-0.5B
- Training data: SIWIS French Speech Synthesis Database (~9,200 samples, 500 benchmark phrases excluded)
- Training type: Full fine-tuning of TTS language model (434M params)
- Frozen components: Acoustic tokenizer (VAE), prediction head (diffusion), language encoder (Qwen2.5 4 layers)
Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 10 |
| Batch size | 4 |
| Gradient accumulation | 4 |
| Effective batch size | 16 |
| Learning rate | 5e-5 |
| Weight decay | 0.01 |
| Warmup steps | 500 |
| Precision | bf16 |
Hardware
- GPU: NVIDIA RTX 6000 Ada (49GB)
Benchmark Results (500 SIWIS French phrases)
| Metric | Value |
|---|---|
| WER (mean) | 35.0% |
| WER (median) | 22.9% |
| RTF (mean) | 0.416 |
Usage
import torch
import soundfile as sf
from vibevoice.modular.modeling_vibevoice_streaming_inference import (
VibeVoiceStreamingForConditionalGenerationInference,
)
model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
"Rcarvalo/vibevoice",
torch_dtype=torch.bfloat16,
).to("cuda")
# Generate French speech
audio = model.generate(text="Bonjour, comment allez-vous aujourd'hui?")
sf.write("output.wav", audio.cpu().numpy(), 24000)
License
MIT (same as base model)
- Downloads last month
- 12