AJ50's picture
Add HuggingFace Spaces YAML metadata to README
5461eb8
|
raw
history blame
16.9 kB
---
title: Voice Cloning Backend
emoji: 🎀
colorFrom: purple
colorTo: blue
sdk: docker
app_file: backend/wsgi.py
pinned: false
---
# Real-Time Voice Cloning (RTVC) - Backend API
A complete full-stack voice cloning application with React frontend and PyTorch backend that can synthesize speech in anyone's voice from just a few seconds of audio reference.
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.5+-red.svg)](https://pytorch.org/)
[![React](https://img.shields.io/badge/React-18.0+-61dafb.svg)](https://reactjs.org/)
[![TypeScript](https://img.shields.io/badge/TypeScript-5.0+-blue.svg)](https://www.typescriptlang.org/)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
## Features
- **Full Stack Application**: Modern React UI + Flask API + PyTorch backend
- **Voice Enrollment**: Record or upload voice samples directly in the browser
- **Speech Synthesis**: Generate cloned speech with intuitive interface
- **Voice Cloning**: Clone any voice with just 3-10 seconds of audio
- **Real-Time Generation**: Generate speech at 2-3x real-time speed on CPU
- **High Quality**: Natural-sounding synthetic speech using state-of-the-art models
- **Easy to Use**: Beautiful UI with 3D visualizations and audio waveforms
- **Multiple Formats**: Supports WAV, MP3, M4A, FLAC input audio
- **Multi-Language**: Supports English and Hindi text-to-speech
## Table of Contents
- [Demo](#demo)
- [Quick Start (Full Stack)](#quick-start-full-stack)
- [Deployment](#deployment)
- [How It Works](#how-it-works)
- [Installation](#installation)
- [Project Structure](#project-structure)
- [Usage Examples](#usage-examples)
- [API Documentation](#api-documentation)
- [Troubleshooting](#troubleshooting)
- [Technical Details](#technical-details)
- [Credits](#credits)
## Demo
**Frontend UI**: Modern React interface with 3D visualizations
**Voice Enrollment**: Record/upload voice samples β†’ Backend saves to database
**Speech Synthesis**: Select voice + Enter text β†’ Backend generates cloned speech
**Playback**: Listen to generated audio directly in browser or download
## Quick Start (Full Stack)
### Option 1: Using the Startup Script (Easiest)
```powershell
# Windows PowerShell
cd rtvc
.\start_app.ps1
```
This will:
1. Start the Backend API server (port 5000)
2. Start the Frontend dev server (port 8080)
3. Open your browser to http://localhost:8080
### Option 2: Manual Start
**Terminal 1 - Backend API:**
```bash
cd rtvc
python api_server.py
```
**Terminal 2 - Frontend:**
```bash
cd "rtvc/Frontend Voice Cloning"
npm run dev
```
Then open http://localhost:8080 in your browser.
## Deployment
### Production Deployment Stack
**Frontend**: Netlify (Free tier)
**Backend**: Render (Free tier)
**Models**: HuggingFace Hub (Free)
See [DEPLOYMENT.md](DEPLOYMENT.md) for complete deployment guide.
#### Quick Deployment
1. **Deploy Backend to Render**
- Push to GitHub
- Connect Render to GitHub repo
- Use `render.yaml` configuration
- Models auto-download on first deploy (~10 minutes)
2. **Deploy Frontend to Netlify**
- Connect Netlify to GitHub repo
- Set base directory: `frontend`
- Environment: `VITE_API_URL=your-render-backend-url`
3. **Test**
- Visit your Netlify URL
- API calls automatically route to Render backend
**Pricing**: Free tier for both (with optional paid upgrades)
### Using the Application
1. **Enroll a Voice**:
- Go to "Voice Enrollment" section
- Enter a voice name
- Record audio (3-10 seconds) or upload a file
- Click "Enroll Voice"
2. **Generate Speech**:
- Go to "Speech Synthesis" section
- Select your enrolled voice
- Enter text to synthesize
- Click "Generate Speech"
- Play or download the result
For detailed integration information, see [INTEGRATION_GUIDE.md](INTEGRATION_GUIDE.md).
## How It Works
The system uses a 3-stage pipeline based on the SV2TTS (Speaker Verification to Text-to-Speech) architecture:
```
Reference Audio β†’ [Encoder] β†’ Speaker Embedding (256-d vector)
↓
Text Input β†’ [Synthesizer (Tacotron)] β†’ Mel-Spectrogram
↓
[Vocoder (WaveRNN)] β†’ Audio Output
```
### Pipeline Stages:
1. **Speaker Encoder** - Extracts a unique voice "fingerprint" from reference audio
2. **Synthesizer** - Generates mel-spectrograms from text conditioned on speaker embedding
3. **Vocoder** - Converts mel-spectrograms to high-quality audio waveforms
## Installation
### Prerequisites
- Python 3.11 or higher
- Windows/Linux/macOS
- ~2 GB disk space for models
- 4 GB RAM minimum (8 GB recommended)
### Step 1: Clone the Repository
```bash
git clone https://github.com/yourusername/rtvc.git
cd rtvc
```
### Step 2: Install Dependencies
```bash
pip install torch numpy librosa scipy soundfile webrtcvad tqdm unidecode inflect matplotlib numba
```
Or install PyTorch with CUDA for GPU acceleration:
```bash
pip install torch --index-url https://download.pytorch.org/whl/cu118
pip install numpy librosa scipy soundfile webrtcvad tqdm unidecode inflect matplotlib numba
```
### Step 3: Download Pretrained Models
Download the pretrained models from [Google Drive](https://drive.google.com/drive/folders/1fU6umc5uQAVR2udZdHX-lDgXYzTyqG_j):
| Model | Size | Description |
|-------|------|-------------|
| encoder.pt | 17 MB | Speaker encoder model |
| synthesizer.pt | 370 MB | Tacotron synthesizer model |
| vocoder.pt | 53 MB | WaveRNN vocoder model |
Place all three files in the `models/default/` directory.
### Step 4: Verify Installation
```bash
python clone_my_voice.py
```
If you see errors about missing models, check that all three `.pt` files are in `models/default/`.
## Quick Start
### Method 1: Simple Script (Recommended)
1. Open `clone_my_voice.py`
2. Edit these lines:
```python
# Your voice sample file
VOICE_FILE = r"sample\your_voice.mp3"
# The text you want to be spoken
TEXT_TO_CLONE = """
Your text here. Can be multiple sentences or even paragraphs!
"""
# Output location
OUTPUT_FILE = r"outputs\cloned_voice.wav"
```
3. Run it:
```bash
python clone_my_voice.py
```
### Method 2: Command Line
```bash
python run_cli.py --voice "path/to/voice.wav" --text "Text to synthesize" --out "output.wav"
```
### Method 3: Advanced Runner Script
```bash
python run_voice_cloning.py
```
Edit the paths and text inside the script before running.
## Project Structure
```
rtvc/
β”œβ”€β”€ clone_my_voice.py # Simple script - EDIT THIS to clone your voice!
β”œβ”€β”€ run_cli.py # Command-line interface
β”‚
β”œβ”€β”€ encoder/ # Speaker Encoder Module
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ audio.py # Audio preprocessing for encoder
β”‚ β”œβ”€β”€ inference.py # Encoder inference functions
β”‚ β”œβ”€β”€ model.py # SpeakerEncoder neural network
β”‚ β”œβ”€β”€ params_data.py # Data hyperparameters
β”‚ └── params_model.py # Model hyperparameters
β”‚
β”œβ”€β”€ synthesizer/ # Tacotron Synthesizer Module
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ audio.py # Audio processing for synthesizer
β”‚ β”œβ”€β”€ hparams.py # All synthesizer hyperparameters
β”‚ β”œβ”€β”€ inference.py # Synthesizer inference class
β”‚ β”‚
β”‚ β”œβ”€β”€ models/
β”‚ β”‚ └── tacotron.py # Tacotron 2 architecture
β”‚ β”‚
β”‚ └── utils/
β”‚ β”œβ”€β”€ cleaners.py # Text cleaning functions
β”‚ β”œβ”€β”€ numbers.py # Number-to-text conversion
β”‚ β”œβ”€β”€ symbols.py # Character/phoneme symbols
β”‚ └── text.py # Text-to-sequence conversion
β”‚
β”œβ”€β”€ vocoder/ # WaveRNN Vocoder Module
β”‚ β”œβ”€β”€ audio.py # Audio utilities for vocoder
β”‚ β”œβ”€β”€ display.py # Progress display utilities
β”‚ β”œβ”€β”€ distribution.py # Probability distributions
β”‚ β”œβ”€β”€ hparams.py # Vocoder hyperparameters
β”‚ β”œβ”€β”€ inference.py # Vocoder inference functions
β”‚ β”‚
β”‚ └── models/
β”‚ └── fatchord_version.py # WaveRNN architecture
β”‚
β”œβ”€β”€ utils/
β”‚ └── default_models.py # Model download utilities
β”‚
β”œβ”€β”€ models/
β”‚ └── default/ # Pretrained models go here
β”‚ β”œβ”€β”€ encoder.pt # (17 MB)
β”‚ β”œβ”€β”€ synthesizer.pt # (370 MB) - Must download!
β”‚ └── vocoder.pt # (53 MB)
β”‚
β”œβ”€β”€ sample/ # Put your voice samples here
β”‚ └── your_voice.mp3
β”‚
└── outputs/ # Generated audio outputs
└── cloned_voice.wav
```
### Key Files Explained
| File | Purpose |
|------|---------|
| `clone_my_voice.py` | **START HERE** - Simplest way to clone your voice |
| `run_cli.py` | Command-line tool for voice cloning |
| `encoder/inference.py` | Loads encoder and extracts speaker embeddings |
| `synthesizer/inference.py` | Loads synthesizer and generates mel-spectrograms |
| `vocoder/inference.py` | Loads vocoder and generates waveforms |
| `**/hparams.py` | Configuration files for each module |
## Usage Examples
### Example 1: Basic Voice Cloning
```bash
python clone_my_voice.py
```
Edit `clone_my_voice.py` first:
```python
VOICE_FILE = r"sample\my_voice.mp3"
TEXT_TO_CLONE = "Hello, this is my cloned voice!"
```
### Example 2: Multiple Outputs
```bash
# Generate first output
python run_cli.py --voice "voice.wav" --text "First message" --out "output1.wav"
# Generate second output with same voice
python run_cli.py --voice "voice.wav" --text "Second message" --out "output2.wav"
```
### Example 3: Long Text
```bash
python run_cli.py --voice "voice.wav" --text "This is a very long text that spans multiple sentences. The voice cloning system will synthesize all of it in the reference voice. You can make it as long as you need."
```
### Example 4: Different Voice Samples
```bash
# Clone voice A
python run_cli.py --voice "person_a.wav" --text "Message from person A"
# Clone voice B
python run_cli.py --voice "person_b.wav" --text "Message from person B"
```
## Troubleshooting
### Common Issues
#### "Model file not found"
**Solution**: Download the models from Google Drive and place them in `models/default/`:
- https://drive.google.com/drive/folders/1fU6umc5uQAVR2udZdHX-lDgXYzTyqG_j
Verify file sizes:
```bash
# Windows
dir models\default\*.pt
# Linux/Mac
ls -lh models/default/*.pt
```
Expected sizes:
- encoder.pt: 17,090,379 bytes (17 MB)
- synthesizer.pt: 370,554,559 bytes (370 MB) - Most common issue!
- vocoder.pt: 53,845,290 bytes (53 MB)
#### "Reference voice file not found"
**Solution**: Use absolute paths or check current directory:
```python
# Use absolute path
VOICE_FILE = r"C:\Users\YourName\Desktop\voice.mp3"
# Or relative from project root
VOICE_FILE = r"sample\voice.mp3"
```
#### Output sounds robotic or unclear
**Solutions**:
- Use a higher quality voice sample (16kHz+ sample rate)
- Ensure voice sample is 3-10 seconds long
- Remove background noise from voice sample
- Speak clearly and naturally in the reference audio
#### "AttributeError: module 'numpy' has no attribute 'cumproduct'"
**Solution**: This is already fixed in the code. If you see this:
```bash
pip install --upgrade numpy
```
#### Slow generation on CPU
**Solutions**:
- Normal speed: 2-3x real-time on modern CPUs
- For faster generation, install PyTorch with CUDA:
```bash
pip install torch --index-url https://download.pytorch.org/whl/cu118
```
Then the system will automatically use GPU if available.
### Getting Help
If you encounter other issues:
1. Check the `HOW_TO_RUN.md` file for detailed instructions
2. Verify all models are downloaded correctly
3. Ensure Python 3.11+ is installed
4. Check that all dependencies are installed
## Technical Details
### Audio Specifications
| Parameter | Value |
|-----------|-------|
| Sample Rate | 16,000 Hz |
| Channels | Mono |
| Bit Depth | 16-bit |
| FFT Size | 800 samples (50ms) |
| Hop Size | 200 samples (12.5ms) |
| Mel Channels | 80 (synthesizer/vocoder), 40 (encoder) |
### Model Architectures
#### Speaker Encoder
- **Type**: LSTM + Linear Projection
- **Input**: 40-channel mel-spectrogram
- **Output**: 256-dimensional speaker embedding
- **Parameters**: ~5M
#### Synthesizer (Tacotron 2)
- **Encoder**: CBHG (Convolution Bank + Highway + GRU)
- **Decoder**: Attention-based LSTM
- **PostNet**: 5-layer Residual CNN
- **Parameters**: ~31M
#### Vocoder (WaveRNN)
- **Type**: Recurrent Neural Vocoder
- **Mode**: Raw 9-bit with mu-law
- **Upsample Factors**: (5, 5, 8)
- **Parameters**: ~4.5M
### Text Processing
The system includes sophisticated text normalization:
- **Numbers**: "123" β†’ "one hundred twenty three"
- **Currency**: "$5.50" β†’ "five dollars, fifty cents"
- **Ordinals**: "1st" β†’ "first"
- **Abbreviations**: "Dr." β†’ "doctor"
- **Unicode**: Automatic transliteration to ASCII
### Performance
| Hardware | Generation Speed |
|----------|------------------|
| CPU (Intel i7) | 2-3x real-time |
| GPU (GTX 1060) | 10-15x real-time |
| GPU (RTX 3080) | 30-50x real-time |
Example: Generating 10 seconds of audio takes ~3-5 seconds on CPU.
## How to Use for Different Applications
### Podcast/Narration
```python
TEXT_TO_CLONE = """
Welcome to today's episode. In this podcast, we'll be discussing
the fascinating world of artificial intelligence and voice synthesis.
Let's dive right in!
"""
```
### Audiobook
```python
TEXT_TO_CLONE = """
Chapter One: The Beginning.
It was a dark and stormy night when everything changed.
The old house stood alone on the hill, its windows dark and unwelcoming.
"""
```
### Voiceover
```python
TEXT_TO_CLONE = """
Introducing the all-new product that will change your life.
With advanced features and intuitive design, it's the perfect solution.
"""
```
### Multiple Languages
The system supports English out of the box. For other languages:
1. Use English transliteration for best results
2. Or modify `synthesizer/utils/cleaners.py` for your language
## Comparison with Other Methods
| Method | Quality | Speed | Setup |
|--------|---------|-------|-------|
| Traditional TTS | Low | Fast | Easy |
| Commercial APIs | High | Fast | API Key Required |
| **This Project** | High | Medium | One-time Setup |
| Training from Scratch | High | Slow | Very Complex |
## Best Practices
### For Best Voice Quality:
1. **Reference Audio**:
- 3-10 seconds long
- Clear speech, no background noise
- Natural speaking tone (not reading/singing)
- 16kHz+ sample rate if possible
2. **Text Input**:
- Use proper punctuation for natural pauses
- Break very long texts into paragraphs
- Avoid excessive special characters
3. **Output**:
- Generate shorter clips for better quality
- Concatenate multiple clips if needed
- Post-process with audio editing software for polish
## Known Limitations
- Works best with English text
- Requires good quality reference audio
- May not perfectly capture very unique voice characteristics
- Background noise in reference affects output quality
- Very short reference audio (<3 seconds) may produce inconsistent results
## Future Improvements
- [ ] Add GUI interface
- [ ] Support for multiple languages
- [ ] Real-time streaming mode
- [ ] Voice mixing/morphing capabilities
- [ ] Fine-tuning on custom datasets
- [ ] Mobile app version
## Credits
This implementation is based on:
- **SV2TTS**: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
- **Tacotron 2**: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
- **WaveRNN**: Efficient Neural Audio Synthesis
Original research papers:
- [SV2TTS Paper](https://arxiv.org/abs/1806.04558)
- [Tacotron 2 Paper](https://arxiv.org/abs/1712.05884)
- [WaveRNN Paper](https://arxiv.org/abs/1802.08435)
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
## Show Your Support
If this project helped you, please give it a star!
## Contact
For questions or support, please open an issue on GitHub.
---
**Made with love by the Voice Cloning Community**
*Last Updated: October 30, 2025*