VibeVoice-Realtime-0.5B Fine-tuned (French SIWIS)

Fine-tuned version of microsoft/VibeVoice-Realtime-0.5B on the French SIWIS dataset for improved French TTS.

Training Details

Base model: microsoft/VibeVoice-Realtime-0.5B
Training data: SIWIS French Speech Synthesis Database (~9,200 samples, 500 benchmark phrases excluded)
Training type: Full fine-tuning of TTS language model (434M params)
Frozen components: Acoustic tokenizer (VAE), prediction head (diffusion), language encoder (Qwen2.5 4 layers)

Hyperparameters

Parameter	Value
Epochs	10
Batch size	4
Gradient accumulation	4
Effective batch size	16
Learning rate	5e-5
Weight decay	0.01
Warmup steps	500
Precision	bf16

Hardware

GPU: NVIDIA RTX 6000 Ada (49GB)

Benchmark Results (500 SIWIS French phrases)

Metric	Value
WER (mean)	35.0%
WER (median)	22.9%
RTF (mean)	0.416

Usage

import torch
import soundfile as sf
from vibevoice.modular.modeling_vibevoice_streaming_inference import (
    VibeVoiceStreamingForConditionalGenerationInference,
)

model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    "Rcarvalo/vibevoice",
    torch_dtype=torch.bfloat16,
).to("cuda")

# Generate French speech
audio = model.generate(text="Bonjour, comment allez-vous aujourd'hui?")
sf.write("output.wav", audio.cpu().numpy(), 24000)

License

MIT (same as base model)

Downloads last month: 12

Model tree for Rcarvalo/vibevoice

Base model

Qwen/Qwen2.5-0.5B

Finetuned

microsoft/VibeVoice-Realtime-0.5B

Finetuned

(20)

this model