VibeVoice-Realtime-0.5B Fine-tuned (French SIWIS)

Fine-tuned version of microsoft/VibeVoice-Realtime-0.5B on the French SIWIS dataset for improved French TTS.

Training Details

  • Base model: microsoft/VibeVoice-Realtime-0.5B
  • Training data: SIWIS French Speech Synthesis Database (~9,200 samples, 500 benchmark phrases excluded)
  • Training type: Full fine-tuning of TTS language model (434M params)
  • Frozen components: Acoustic tokenizer (VAE), prediction head (diffusion), language encoder (Qwen2.5 4 layers)

Hyperparameters

Parameter Value
Epochs 10
Batch size 4
Gradient accumulation 4
Effective batch size 16
Learning rate 5e-5
Weight decay 0.01
Warmup steps 500
Precision bf16

Hardware

  • GPU: NVIDIA RTX 6000 Ada (49GB)

Benchmark Results (500 SIWIS French phrases)

Metric Value
WER (mean) 35.0%
WER (median) 22.9%
RTF (mean) 0.416

Usage

import torch
import soundfile as sf
from vibevoice.modular.modeling_vibevoice_streaming_inference import (
    VibeVoiceStreamingForConditionalGenerationInference,
)

model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    "Rcarvalo/vibevoice",
    torch_dtype=torch.bfloat16,
).to("cuda")

# Generate French speech
audio = model.generate(text="Bonjour, comment allez-vous aujourd'hui?")
sf.write("output.wav", audio.cpu().numpy(), 24000)

License

MIT (same as base model)

Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Rcarvalo/vibevoice

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(20)
this model