Spaces:

AJ50
/

voice-cloning-backend

Sleeping

App Files Files Community

voice-cloning-backend / README.md

AJ50

Add HuggingFace Spaces YAML metadata to README

5461eb8 15 days ago

preview code

raw

history blame

16.9 kB

	---
	title: Voice Cloning Backend
	emoji: 🎤
	colorFrom: purple
	colorTo: blue
	sdk: docker
	app_file: backend/wsgi.py
	pinned: false
	---

	# Real-Time Voice Cloning (RTVC) - Backend API

	A complete full-stack voice cloning application with React frontend and PyTorch backend that can synthesize speech in anyone's voice from just a few seconds of audio reference.

	[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
	[![PyTorch](https://img.shields.io/badge/PyTorch-2.5+-red.svg)](https://pytorch.org/)
	[![React](https://img.shields.io/badge/React-18.0+-61dafb.svg)](https://reactjs.org/)
	[![TypeScript](https://img.shields.io/badge/TypeScript-5.0+-blue.svg)](https://www.typescriptlang.org/)
	[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

	## Features

	- Full Stack Application: Modern React UI + Flask API + PyTorch backend
	- Voice Enrollment: Record or upload voice samples directly in the browser
	- Speech Synthesis: Generate cloned speech with intuitive interface
	- Voice Cloning: Clone any voice with just 3-10 seconds of audio
	- Real-Time Generation: Generate speech at 2-3x real-time speed on CPU
	- High Quality: Natural-sounding synthetic speech using state-of-the-art models
	- Easy to Use: Beautiful UI with 3D visualizations and audio waveforms
	- Multiple Formats: Supports WAV, MP3, M4A, FLAC input audio
	- Multi-Language: Supports English and Hindi text-to-speech

	## Table of Contents

	- [Demo](#demo)
	- [Quick Start (Full Stack)](#quick-start-full-stack)
	- [Deployment](#deployment)
	- [How It Works](#how-it-works)
	- [Installation](#installation)
	- [Project Structure](#project-structure)
	- [Usage Examples](#usage-examples)
	- [API Documentation](#api-documentation)
	- [Troubleshooting](#troubleshooting)
	- [Technical Details](#technical-details)
	- [Credits](#credits)

	## Demo

	Frontend UI: Modern React interface with 3D visualizations
	Voice Enrollment: Record/upload voice samples → Backend saves to database
	Speech Synthesis: Select voice + Enter text → Backend generates cloned speech
	Playback: Listen to generated audio directly in browser or download

	## Quick Start (Full Stack)

	### Option 1: Using the Startup Script (Easiest)

	```powershell
	# Windows PowerShell
	cd rtvc
	.\start_app.ps1
	```

	This will:
	1. Start the Backend API server (port 5000)
	2. Start the Frontend dev server (port 8080)
	3. Open your browser to http://localhost:8080

	### Option 2: Manual Start

	Terminal 1 - Backend API:
	```bash
	cd rtvc
	python api_server.py
	```

	Terminal 2 - Frontend:
	```bash
	cd "rtvc/Frontend Voice Cloning"
	npm run dev
	```

	Then open http://localhost:8080 in your browser.

	## Deployment

	### Production Deployment Stack

	Frontend: Netlify (Free tier)
	Backend: Render (Free tier)
	Models: HuggingFace Hub (Free)

	See [DEPLOYMENT.md](DEPLOYMENT.md) for complete deployment guide.

	#### Quick Deployment

	1. Deploy Backend to Render
	- Push to GitHub
	- Connect Render to GitHub repo
	- Use `render.yaml` configuration
	- Models auto-download on first deploy (~10 minutes)

	2. Deploy Frontend to Netlify
	- Connect Netlify to GitHub repo
	- Set base directory: `frontend`
	- Environment: `VITE_API_URL=your-render-backend-url`

	3. Test
	- Visit your Netlify URL
	- API calls automatically route to Render backend

	Pricing: Free tier for both (with optional paid upgrades)

	### Using the Application

	1. Enroll a Voice:
	- Go to "Voice Enrollment" section
	- Enter a voice name
	- Record audio (3-10 seconds) or upload a file
	- Click "Enroll Voice"

	2. Generate Speech:
	- Go to "Speech Synthesis" section
	- Select your enrolled voice
	- Enter text to synthesize
	- Click "Generate Speech"
	- Play or download the result

	For detailed integration information, see [INTEGRATION_GUIDE.md](INTEGRATION_GUIDE.md).

	## How It Works

	The system uses a 3-stage pipeline based on the SV2TTS (Speaker Verification to Text-to-Speech) architecture:

	```
	Reference Audio → [Encoder] → Speaker Embedding (256-d vector)
	↓
	Text Input → [Synthesizer (Tacotron)] → Mel-Spectrogram
	↓
	[Vocoder (WaveRNN)] → Audio Output
	```

	### Pipeline Stages:

	1. Speaker Encoder - Extracts a unique voice "fingerprint" from reference audio
	2. Synthesizer - Generates mel-spectrograms from text conditioned on speaker embedding
	3. Vocoder - Converts mel-spectrograms to high-quality audio waveforms

	## Installation

	### Prerequisites

	- Python 3.11 or higher
	- Windows/Linux/macOS
	- ~2 GB disk space for models
	- 4 GB RAM minimum (8 GB recommended)

	### Step 1: Clone the Repository

	```bash
	git clone https://github.com/yourusername/rtvc.git
	cd rtvc
	```

	### Step 2: Install Dependencies

	```bash
	pip install torch numpy librosa scipy soundfile webrtcvad tqdm unidecode inflect matplotlib numba
	```

	Or install PyTorch with CUDA for GPU acceleration:

	```bash
	pip install torch --index-url https://download.pytorch.org/whl/cu118
	pip install numpy librosa scipy soundfile webrtcvad tqdm unidecode inflect matplotlib numba
	```

	### Step 3: Download Pretrained Models

	Download the pretrained models from [Google Drive](https://drive.google.com/drive/folders/1fU6umc5uQAVR2udZdHX-lDgXYzTyqG_j):

	\| Model \| Size \| Description \|
	\|-------\|------\|-------------\|
	\| encoder.pt \| 17 MB \| Speaker encoder model \|
	\| synthesizer.pt \| 370 MB \| Tacotron synthesizer model \|
	\| vocoder.pt \| 53 MB \| WaveRNN vocoder model \|

	Place all three files in the `models/default/` directory.

	### Step 4: Verify Installation

	```bash
	python clone_my_voice.py
	```

	If you see errors about missing models, check that all three `.pt` files are in `models/default/`.

	## Quick Start

	### Method 1: Simple Script (Recommended)

	1. Open `clone_my_voice.py`
	2. Edit these lines:

	```python
	# Your voice sample file
	VOICE_FILE = r"sample\your_voice.mp3"

	# The text you want to be spoken
	TEXT_TO_CLONE = """
	Your text here. Can be multiple sentences or even paragraphs!
	"""

	# Output location
	OUTPUT_FILE = r"outputs\cloned_voice.wav"
	```

	3. Run it:

	```bash
	python clone_my_voice.py
	```

	### Method 2: Command Line

	```bash
	python run_cli.py --voice "path/to/voice.wav" --text "Text to synthesize" --out "output.wav"
	```

	### Method 3: Advanced Runner Script

	```bash
	python run_voice_cloning.py
	```

	Edit the paths and text inside the script before running.

	## Project Structure

	```
	rtvc/
	├── clone_my_voice.py # Simple script - EDIT THIS to clone your voice!
	├── run_cli.py # Command-line interface
	│
	├── encoder/ # Speaker Encoder Module
	│ ├── __init__.py
	│ ├── audio.py # Audio preprocessing for encoder
	│ ├── inference.py # Encoder inference functions
	│ ├── model.py # SpeakerEncoder neural network
	│ ├── params_data.py # Data hyperparameters
	│ └── params_model.py # Model hyperparameters
	│
	├── synthesizer/ # Tacotron Synthesizer Module
	│ ├── __init__.py
	│ ├── audio.py # Audio processing for synthesizer
	│ ├── hparams.py # All synthesizer hyperparameters
	│ ├── inference.py # Synthesizer inference class
	│ │
	│ ├── models/
	│ │ └── tacotron.py # Tacotron 2 architecture
	│ │
	│ └── utils/
	│ ├── cleaners.py # Text cleaning functions
	│ ├── numbers.py # Number-to-text conversion
	│ ├── symbols.py # Character/phoneme symbols
	│ └── text.py # Text-to-sequence conversion
	│
	├── vocoder/ # WaveRNN Vocoder Module
	│ ├── audio.py # Audio utilities for vocoder
	│ ├── display.py # Progress display utilities
	│ ├── distribution.py # Probability distributions
	│ ├── hparams.py # Vocoder hyperparameters
	│ ├── inference.py # Vocoder inference functions
	│ │
	│ └── models/
	│ └── fatchord_version.py # WaveRNN architecture
	│
	├── utils/
	│ └── default_models.py # Model download utilities
	│
	├── models/
	│ └── default/ # Pretrained models go here
	│ ├── encoder.pt # (17 MB)
	│ ├── synthesizer.pt # (370 MB) - Must download!
	│ └── vocoder.pt # (53 MB)
	│
	├── sample/ # Put your voice samples here
	│ └── your_voice.mp3
	│
	└── outputs/ # Generated audio outputs
	└── cloned_voice.wav
	```

	### Key Files Explained

	\| File \| Purpose \|
	\|------\|---------\|
	\| `clone_my_voice.py` \| START HERE - Simplest way to clone your voice \|
	\| `run_cli.py` \| Command-line tool for voice cloning \|
	\| `encoder/inference.py` \| Loads encoder and extracts speaker embeddings \|
	\| `synthesizer/inference.py` \| Loads synthesizer and generates mel-spectrograms \|
	\| `vocoder/inference.py` \| Loads vocoder and generates waveforms \|
	\| `**/hparams.py` \| Configuration files for each module \|

	## Usage Examples

	### Example 1: Basic Voice Cloning

	```bash
	python clone_my_voice.py
	```

	Edit `clone_my_voice.py` first:
	```python
	VOICE_FILE = r"sample\my_voice.mp3"
	TEXT_TO_CLONE = "Hello, this is my cloned voice!"
	```

	### Example 2: Multiple Outputs

	```bash
	# Generate first output
	python run_cli.py --voice "voice.wav" --text "First message" --out "output1.wav"

	# Generate second output with same voice
	python run_cli.py --voice "voice.wav" --text "Second message" --out "output2.wav"
	```

	### Example 3: Long Text

	```bash
	python run_cli.py --voice "voice.wav" --text "This is a very long text that spans multiple sentences. The voice cloning system will synthesize all of it in the reference voice. You can make it as long as you need."
	```

	### Example 4: Different Voice Samples

	```bash
	# Clone voice A
	python run_cli.py --voice "person_a.wav" --text "Message from person A"

	# Clone voice B
	python run_cli.py --voice "person_b.wav" --text "Message from person B"
	```

	## Troubleshooting

	### Common Issues

	#### "Model file not found"

	Solution: Download the models from Google Drive and place them in `models/default/`:
	- https://drive.google.com/drive/folders/1fU6umc5uQAVR2udZdHX-lDgXYzTyqG_j

	Verify file sizes:
	```bash
	# Windows
	dir models\default\*.pt

	# Linux/Mac
	ls -lh models/default/*.pt
	```

	Expected sizes:
	- encoder.pt: 17,090,379 bytes (17 MB)
	- synthesizer.pt: 370,554,559 bytes (370 MB) - Most common issue!
	- vocoder.pt: 53,845,290 bytes (53 MB)

	#### "Reference voice file not found"

	Solution: Use absolute paths or check current directory:
	```python
	# Use absolute path
	VOICE_FILE = r"C:\Users\YourName\Desktop\voice.mp3"

	# Or relative from project root
	VOICE_FILE = r"sample\voice.mp3"
	```

	#### Output sounds robotic or unclear

	Solutions:
	- Use a higher quality voice sample (16kHz+ sample rate)
	- Ensure voice sample is 3-10 seconds long
	- Remove background noise from voice sample
	- Speak clearly and naturally in the reference audio

	#### "AttributeError: module 'numpy' has no attribute 'cumproduct'"

	Solution: This is already fixed in the code. If you see this:
	```bash
	pip install --upgrade numpy
	```

	#### Slow generation on CPU

	Solutions:
	- Normal speed: 2-3x real-time on modern CPUs
	- For faster generation, install PyTorch with CUDA:
	```bash
	pip install torch --index-url https://download.pytorch.org/whl/cu118
	```

	Then the system will automatically use GPU if available.

	### Getting Help

	If you encounter other issues:
	1. Check the `HOW_TO_RUN.md` file for detailed instructions
	2. Verify all models are downloaded correctly
	3. Ensure Python 3.11+ is installed
	4. Check that all dependencies are installed

	## Technical Details

	### Audio Specifications

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Sample Rate \| 16,000 Hz \|
	\| Channels \| Mono \|
	\| Bit Depth \| 16-bit \|
	\| FFT Size \| 800 samples (50ms) \|
	\| Hop Size \| 200 samples (12.5ms) \|
	\| Mel Channels \| 80 (synthesizer/vocoder), 40 (encoder) \|

	### Model Architectures

	#### Speaker Encoder
	- Type: LSTM + Linear Projection
	- Input: 40-channel mel-spectrogram
	- Output: 256-dimensional speaker embedding
	- Parameters: ~5M

	#### Synthesizer (Tacotron 2)
	- Encoder: CBHG (Convolution Bank + Highway + GRU)
	- Decoder: Attention-based LSTM
	- PostNet: 5-layer Residual CNN
	- Parameters: ~31M

	#### Vocoder (WaveRNN)
	- Type: Recurrent Neural Vocoder
	- Mode: Raw 9-bit with mu-law
	- Upsample Factors: (5, 5, 8)
	- Parameters: ~4.5M

	### Text Processing

	The system includes sophisticated text normalization:
	- Numbers: "123" → "one hundred twenty three"
	- Currency: "$5.50" → "five dollars, fifty cents"
	- Ordinals: "1st" → "first"
	- Abbreviations: "Dr." → "doctor"
	- Unicode: Automatic transliteration to ASCII

	### Performance

	\| Hardware \| Generation Speed \|
	\|----------\|------------------\|
	\| CPU (Intel i7) \| 2-3x real-time \|
	\| GPU (GTX 1060) \| 10-15x real-time \|
	\| GPU (RTX 3080) \| 30-50x real-time \|

	Example: Generating 10 seconds of audio takes ~3-5 seconds on CPU.

	## How to Use for Different Applications

	### Podcast/Narration
	```python
	TEXT_TO_CLONE = """
	Welcome to today's episode. In this podcast, we'll be discussing
	the fascinating world of artificial intelligence and voice synthesis.
	Let's dive right in!
	"""
	```

	### Audiobook
	```python
	TEXT_TO_CLONE = """
	Chapter One: The Beginning.
	It was a dark and stormy night when everything changed.
	The old house stood alone on the hill, its windows dark and unwelcoming.
	"""
	```

	### Voiceover
	```python
	TEXT_TO_CLONE = """
	Introducing the all-new product that will change your life.
	With advanced features and intuitive design, it's the perfect solution.
	"""
	```

	### Multiple Languages
	The system supports English out of the box. For other languages:
	1. Use English transliteration for best results
	2. Or modify `synthesizer/utils/cleaners.py` for your language

	## Comparison with Other Methods

	\| Method \| Quality \| Speed \| Setup \|
	\|--------\|---------\|-------\|-------\|
	\| Traditional TTS \| Low \| Fast \| Easy \|
	\| Commercial APIs \| High \| Fast \| API Key Required \|
	\| This Project \| High \| Medium \| One-time Setup \|
	\| Training from Scratch \| High \| Slow \| Very Complex \|

	## Best Practices

	### For Best Voice Quality:

	1. Reference Audio:
	- 3-10 seconds long
	- Clear speech, no background noise
	- Natural speaking tone (not reading/singing)
	- 16kHz+ sample rate if possible

	2. Text Input:
	- Use proper punctuation for natural pauses
	- Break very long texts into paragraphs
	- Avoid excessive special characters

	3. Output:
	- Generate shorter clips for better quality
	- Concatenate multiple clips if needed
	- Post-process with audio editing software for polish

	## Known Limitations

	- Works best with English text
	- Requires good quality reference audio
	- May not perfectly capture very unique voice characteristics
	- Background noise in reference affects output quality
	- Very short reference audio (<3 seconds) may produce inconsistent results

	## Future Improvements

	- [ ] Add GUI interface
	- [ ] Support for multiple languages
	- [ ] Real-time streaming mode
	- [ ] Voice mixing/morphing capabilities
	- [ ] Fine-tuning on custom datasets
	- [ ] Mobile app version

	## Credits

	This implementation is based on:
	- SV2TTS: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
	- Tacotron 2: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
	- WaveRNN: Efficient Neural Audio Synthesis

	Original research papers:
	- [SV2TTS Paper](https://arxiv.org/abs/1806.04558)
	- [Tacotron 2 Paper](https://arxiv.org/abs/1712.05884)
	- [WaveRNN Paper](https://arxiv.org/abs/1802.08435)

	## License

	This project is licensed under the MIT License - see the LICENSE file for details.

	## Contributing

	Contributions are welcome! Please feel free to submit a Pull Request.

	1. Fork the repository
	2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
	3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
	4. Push to the branch (`git push origin feature/AmazingFeature`)
	5. Open a Pull Request

	## Show Your Support

	If this project helped you, please give it a star!

	## Contact

	For questions or support, please open an issue on GitHub.

	---

	Made with love by the Voice Cloning Community

	Last Updated: October 30, 2025