--- title: Voice Cloning Backend emoji: 🎤 colorFrom: purple colorTo: blue sdk: docker app_file: backend/wsgi.py pinned: false --- # Real-Time Voice Cloning (RTVC) - Backend API A complete full-stack voice cloning application with React frontend and PyTorch backend that can synthesize speech in anyone's voice from just a few seconds of audio reference. [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![PyTorch](https://img.shields.io/badge/PyTorch-2.5+-red.svg)](https://pytorch.org/) [![React](https://img.shields.io/badge/React-18.0+-61dafb.svg)](https://reactjs.org/) [![TypeScript](https://img.shields.io/badge/TypeScript-5.0+-blue.svg)](https://www.typescriptlang.org/) [![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE) ## Features - **Full Stack Application**: Modern React UI + Flask API + PyTorch backend - **Voice Enrollment**: Record or upload voice samples directly in the browser - **Speech Synthesis**: Generate cloned speech with intuitive interface - **Voice Cloning**: Clone any voice with just 3-10 seconds of audio - **Real-Time Generation**: Generate speech at 2-3x real-time speed on CPU - **High Quality**: Natural-sounding synthetic speech using state-of-the-art models - **Easy to Use**: Beautiful UI with 3D visualizations and audio waveforms - **Multiple Formats**: Supports WAV, MP3, M4A, FLAC input audio - **Multi-Language**: Supports English and Hindi text-to-speech ## Table of Contents - [Demo](#demo) - [Quick Start (Full Stack)](#quick-start-full-stack) - [Deployment](#deployment) - [How It Works](#how-it-works) - [Installation](#installation) - [Project Structure](#project-structure) - [Usage Examples](#usage-examples) - [API Documentation](#api-documentation) - [Troubleshooting](#troubleshooting) - [Technical Details](#technical-details) - [Credits](#credits) ## Demo **Frontend UI**: Modern React interface with 3D visualizations **Voice Enrollment**: Record/upload voice samples → Backend saves to database **Speech Synthesis**: Select voice + Enter text → Backend generates cloned speech **Playback**: Listen to generated audio directly in browser or download ## Quick Start (Full Stack) ### Option 1: Using the Startup Script (Easiest) ```powershell # Windows PowerShell cd rtvc .\start_app.ps1 ``` This will: 1. Start the Backend API server (port 5000) 2. Start the Frontend dev server (port 8080) 3. Open your browser to http://localhost:8080 ### Option 2: Manual Start **Terminal 1 - Backend API:** ```bash cd rtvc python api_server.py ``` **Terminal 2 - Frontend:** ```bash cd "rtvc/Frontend Voice Cloning" npm run dev ``` Then open http://localhost:8080 in your browser. ## Deployment ### Production Deployment Stack **Frontend**: Netlify (Free tier) **Backend**: Render (Free tier) **Models**: HuggingFace Hub (Free) See [DEPLOYMENT.md](DEPLOYMENT.md) for complete deployment guide. #### Quick Deployment 1. **Deploy Backend to Render** - Push to GitHub - Connect Render to GitHub repo - Use `render.yaml` configuration - Models auto-download on first deploy (~10 minutes) 2. **Deploy Frontend to Netlify** - Connect Netlify to GitHub repo - Set base directory: `frontend` - Environment: `VITE_API_URL=your-render-backend-url` 3. **Test** - Visit your Netlify URL - API calls automatically route to Render backend **Pricing**: Free tier for both (with optional paid upgrades) ### Using the Application 1. **Enroll a Voice**: - Go to "Voice Enrollment" section - Enter a voice name - Record audio (3-10 seconds) or upload a file - Click "Enroll Voice" 2. **Generate Speech**: - Go to "Speech Synthesis" section - Select your enrolled voice - Enter text to synthesize - Click "Generate Speech" - Play or download the result For detailed integration information, see [INTEGRATION_GUIDE.md](INTEGRATION_GUIDE.md). ## How It Works The system uses a 3-stage pipeline based on the SV2TTS (Speaker Verification to Text-to-Speech) architecture: ``` Reference Audio → [Encoder] → Speaker Embedding (256-d vector) ↓ Text Input → [Synthesizer (Tacotron)] → Mel-Spectrogram ↓ [Vocoder (WaveRNN)] → Audio Output ``` ### Pipeline Stages: 1. **Speaker Encoder** - Extracts a unique voice "fingerprint" from reference audio 2. **Synthesizer** - Generates mel-spectrograms from text conditioned on speaker embedding 3. **Vocoder** - Converts mel-spectrograms to high-quality audio waveforms ## Installation ### Prerequisites - Python 3.11 or higher - Windows/Linux/macOS - ~2 GB disk space for models - 4 GB RAM minimum (8 GB recommended) ### Step 1: Clone the Repository ```bash git clone https://github.com/yourusername/rtvc.git cd rtvc ``` ### Step 2: Install Dependencies ```bash pip install torch numpy librosa scipy soundfile webrtcvad tqdm unidecode inflect matplotlib numba ``` Or install PyTorch with CUDA for GPU acceleration: ```bash pip install torch --index-url https://download.pytorch.org/whl/cu118 pip install numpy librosa scipy soundfile webrtcvad tqdm unidecode inflect matplotlib numba ``` ### Step 3: Download Pretrained Models Download the pretrained models from [Google Drive](https://drive.google.com/drive/folders/1fU6umc5uQAVR2udZdHX-lDgXYzTyqG_j): | Model | Size | Description | |-------|------|-------------| | encoder.pt | 17 MB | Speaker encoder model | | synthesizer.pt | 370 MB | Tacotron synthesizer model | | vocoder.pt | 53 MB | WaveRNN vocoder model | Place all three files in the `models/default/` directory. ### Step 4: Verify Installation ```bash python clone_my_voice.py ``` If you see errors about missing models, check that all three `.pt` files are in `models/default/`. ## Quick Start ### Method 1: Simple Script (Recommended) 1. Open `clone_my_voice.py` 2. Edit these lines: ```python # Your voice sample file VOICE_FILE = r"sample\your_voice.mp3" # The text you want to be spoken TEXT_TO_CLONE = """ Your text here. Can be multiple sentences or even paragraphs! """ # Output location OUTPUT_FILE = r"outputs\cloned_voice.wav" ``` 3. Run it: ```bash python clone_my_voice.py ``` ### Method 2: Command Line ```bash python run_cli.py --voice "path/to/voice.wav" --text "Text to synthesize" --out "output.wav" ``` ### Method 3: Advanced Runner Script ```bash python run_voice_cloning.py ``` Edit the paths and text inside the script before running. ## Project Structure ``` rtvc/ ├── clone_my_voice.py # Simple script - EDIT THIS to clone your voice! ├── run_cli.py # Command-line interface │ ├── encoder/ # Speaker Encoder Module │ ├── __init__.py │ ├── audio.py # Audio preprocessing for encoder │ ├── inference.py # Encoder inference functions │ ├── model.py # SpeakerEncoder neural network │ ├── params_data.py # Data hyperparameters │ └── params_model.py # Model hyperparameters │ ├── synthesizer/ # Tacotron Synthesizer Module │ ├── __init__.py │ ├── audio.py # Audio processing for synthesizer │ ├── hparams.py # All synthesizer hyperparameters │ ├── inference.py # Synthesizer inference class │ │ │ ├── models/ │ │ └── tacotron.py # Tacotron 2 architecture │ │ │ └── utils/ │ ├── cleaners.py # Text cleaning functions │ ├── numbers.py # Number-to-text conversion │ ├── symbols.py # Character/phoneme symbols │ └── text.py # Text-to-sequence conversion │ ├── vocoder/ # WaveRNN Vocoder Module │ ├── audio.py # Audio utilities for vocoder │ ├── display.py # Progress display utilities │ ├── distribution.py # Probability distributions │ ├── hparams.py # Vocoder hyperparameters │ ├── inference.py # Vocoder inference functions │ │ │ └── models/ │ └── fatchord_version.py # WaveRNN architecture │ ├── utils/ │ └── default_models.py # Model download utilities │ ├── models/ │ └── default/ # Pretrained models go here │ ├── encoder.pt # (17 MB) │ ├── synthesizer.pt # (370 MB) - Must download! │ └── vocoder.pt # (53 MB) │ ├── sample/ # Put your voice samples here │ └── your_voice.mp3 │ └── outputs/ # Generated audio outputs └── cloned_voice.wav ``` ### Key Files Explained | File | Purpose | |------|---------| | `clone_my_voice.py` | **START HERE** - Simplest way to clone your voice | | `run_cli.py` | Command-line tool for voice cloning | | `encoder/inference.py` | Loads encoder and extracts speaker embeddings | | `synthesizer/inference.py` | Loads synthesizer and generates mel-spectrograms | | `vocoder/inference.py` | Loads vocoder and generates waveforms | | `**/hparams.py` | Configuration files for each module | ## Usage Examples ### Example 1: Basic Voice Cloning ```bash python clone_my_voice.py ``` Edit `clone_my_voice.py` first: ```python VOICE_FILE = r"sample\my_voice.mp3" TEXT_TO_CLONE = "Hello, this is my cloned voice!" ``` ### Example 2: Multiple Outputs ```bash # Generate first output python run_cli.py --voice "voice.wav" --text "First message" --out "output1.wav" # Generate second output with same voice python run_cli.py --voice "voice.wav" --text "Second message" --out "output2.wav" ``` ### Example 3: Long Text ```bash python run_cli.py --voice "voice.wav" --text "This is a very long text that spans multiple sentences. The voice cloning system will synthesize all of it in the reference voice. You can make it as long as you need." ``` ### Example 4: Different Voice Samples ```bash # Clone voice A python run_cli.py --voice "person_a.wav" --text "Message from person A" # Clone voice B python run_cli.py --voice "person_b.wav" --text "Message from person B" ``` ## Troubleshooting ### Common Issues #### "Model file not found" **Solution**: Download the models from Google Drive and place them in `models/default/`: - https://drive.google.com/drive/folders/1fU6umc5uQAVR2udZdHX-lDgXYzTyqG_j Verify file sizes: ```bash # Windows dir models\default\*.pt # Linux/Mac ls -lh models/default/*.pt ``` Expected sizes: - encoder.pt: 17,090,379 bytes (17 MB) - synthesizer.pt: 370,554,559 bytes (370 MB) - Most common issue! - vocoder.pt: 53,845,290 bytes (53 MB) #### "Reference voice file not found" **Solution**: Use absolute paths or check current directory: ```python # Use absolute path VOICE_FILE = r"C:\Users\YourName\Desktop\voice.mp3" # Or relative from project root VOICE_FILE = r"sample\voice.mp3" ``` #### Output sounds robotic or unclear **Solutions**: - Use a higher quality voice sample (16kHz+ sample rate) - Ensure voice sample is 3-10 seconds long - Remove background noise from voice sample - Speak clearly and naturally in the reference audio #### "AttributeError: module 'numpy' has no attribute 'cumproduct'" **Solution**: This is already fixed in the code. If you see this: ```bash pip install --upgrade numpy ``` #### Slow generation on CPU **Solutions**: - Normal speed: 2-3x real-time on modern CPUs - For faster generation, install PyTorch with CUDA: ```bash pip install torch --index-url https://download.pytorch.org/whl/cu118 ``` Then the system will automatically use GPU if available. ### Getting Help If you encounter other issues: 1. Check the `HOW_TO_RUN.md` file for detailed instructions 2. Verify all models are downloaded correctly 3. Ensure Python 3.11+ is installed 4. Check that all dependencies are installed ## Technical Details ### Audio Specifications | Parameter | Value | |-----------|-------| | Sample Rate | 16,000 Hz | | Channels | Mono | | Bit Depth | 16-bit | | FFT Size | 800 samples (50ms) | | Hop Size | 200 samples (12.5ms) | | Mel Channels | 80 (synthesizer/vocoder), 40 (encoder) | ### Model Architectures #### Speaker Encoder - **Type**: LSTM + Linear Projection - **Input**: 40-channel mel-spectrogram - **Output**: 256-dimensional speaker embedding - **Parameters**: ~5M #### Synthesizer (Tacotron 2) - **Encoder**: CBHG (Convolution Bank + Highway + GRU) - **Decoder**: Attention-based LSTM - **PostNet**: 5-layer Residual CNN - **Parameters**: ~31M #### Vocoder (WaveRNN) - **Type**: Recurrent Neural Vocoder - **Mode**: Raw 9-bit with mu-law - **Upsample Factors**: (5, 5, 8) - **Parameters**: ~4.5M ### Text Processing The system includes sophisticated text normalization: - **Numbers**: "123" → "one hundred twenty three" - **Currency**: "$5.50" → "five dollars, fifty cents" - **Ordinals**: "1st" → "first" - **Abbreviations**: "Dr." → "doctor" - **Unicode**: Automatic transliteration to ASCII ### Performance | Hardware | Generation Speed | |----------|------------------| | CPU (Intel i7) | 2-3x real-time | | GPU (GTX 1060) | 10-15x real-time | | GPU (RTX 3080) | 30-50x real-time | Example: Generating 10 seconds of audio takes ~3-5 seconds on CPU. ## How to Use for Different Applications ### Podcast/Narration ```python TEXT_TO_CLONE = """ Welcome to today's episode. In this podcast, we'll be discussing the fascinating world of artificial intelligence and voice synthesis. Let's dive right in! """ ``` ### Audiobook ```python TEXT_TO_CLONE = """ Chapter One: The Beginning. It was a dark and stormy night when everything changed. The old house stood alone on the hill, its windows dark and unwelcoming. """ ``` ### Voiceover ```python TEXT_TO_CLONE = """ Introducing the all-new product that will change your life. With advanced features and intuitive design, it's the perfect solution. """ ``` ### Multiple Languages The system supports English out of the box. For other languages: 1. Use English transliteration for best results 2. Or modify `synthesizer/utils/cleaners.py` for your language ## Comparison with Other Methods | Method | Quality | Speed | Setup | |--------|---------|-------|-------| | Traditional TTS | Low | Fast | Easy | | Commercial APIs | High | Fast | API Key Required | | **This Project** | High | Medium | One-time Setup | | Training from Scratch | High | Slow | Very Complex | ## Best Practices ### For Best Voice Quality: 1. **Reference Audio**: - 3-10 seconds long - Clear speech, no background noise - Natural speaking tone (not reading/singing) - 16kHz+ sample rate if possible 2. **Text Input**: - Use proper punctuation for natural pauses - Break very long texts into paragraphs - Avoid excessive special characters 3. **Output**: - Generate shorter clips for better quality - Concatenate multiple clips if needed - Post-process with audio editing software for polish ## Known Limitations - Works best with English text - Requires good quality reference audio - May not perfectly capture very unique voice characteristics - Background noise in reference affects output quality - Very short reference audio (<3 seconds) may produce inconsistent results ## Future Improvements - [ ] Add GUI interface - [ ] Support for multiple languages - [ ] Real-time streaming mode - [ ] Voice mixing/morphing capabilities - [ ] Fine-tuning on custom datasets - [ ] Mobile app version ## Credits This implementation is based on: - **SV2TTS**: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis - **Tacotron 2**: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions - **WaveRNN**: Efficient Neural Audio Synthesis Original research papers: - [SV2TTS Paper](https://arxiv.org/abs/1806.04558) - [Tacotron 2 Paper](https://arxiv.org/abs/1712.05884) - [WaveRNN Paper](https://arxiv.org/abs/1802.08435) ## License This project is licensed under the MIT License - see the LICENSE file for details. ## Contributing Contributions are welcome! Please feel free to submit a Pull Request. 1. Fork the repository 2. Create your feature branch (`git checkout -b feature/AmazingFeature`) 3. Commit your changes (`git commit -m 'Add some AmazingFeature'`) 4. Push to the branch (`git push origin feature/AmazingFeature`) 5. Open a Pull Request ## Show Your Support If this project helped you, please give it a star! ## Contact For questions or support, please open an issue on GitHub. --- **Made with love by the Voice Cloning Community** *Last Updated: October 30, 2025*