Instructions to use scragnog/HOT-Step-CPP-PP-VAE with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- ACE-Step
How to use scragnog/HOT-Step-CPP-PP-VAE with ACE-Step:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
PP-VAE β Post-Processing VAE for ACE-Step Music Generation
A GGUF conversion of the AutoencoderOobleck VAE from Tencent AI Lab's SongGeneration (LeVo) project, repurposed as a post-processing audio re-encoder for the ACE-Step 1.5 music generation pipeline.
TL;DR: Run your generated audio through this VAE's encodeβdecode round-trip to get spectral cleanup, reduced artifacts, and improved tonal coherence β like a neural audio polish pass.
What Does PP-VAE Do?
PP-VAE (Post-Processing VAE) performs a full encodeβdecode cycle on already-generated audio:
Generated WAV β PP-VAE Encoder β Latent Space β PP-VAE Decoder β Cleaned WAV
Because this VAE was trained on high-quality music (1.32M steps by Tencent AI Lab), the round-trip acts as a learned spectral filter β the encoder projects audio into a compressed latent representation, and the decoder reconstructs it, implicitly smoothing synthesis artifacts and tightening spectral coherence.
Why Use a Second VAE?
ACE-Step already has its own VAE for the primary generation pipeline. The PP-VAE is a different model from a different training regime (SongGeneration/LeVo), and using it as a post-processing step provides:
- Artifact reduction β smooths out spectral irregularities from the primary VAE decode
- Tonal coherence β the LeVo VAE was trained on a large, diverse music corpus, so its latent space provides a natural "spectral prior" that can regularise generated audio
- Configurable blending β a wet/dry mix control lets you dial in exactly how much PP-VAE processing to apply (0% = fully processed, 100% = fully original)
- RMS gain matching β automatic loudness normalisation ensures the output matches the input's dynamic envelope
What It Won't Do
PP-VAE is not a magical quality enhancer. It's a lossy round-trip through a neural codec, so:
- Ultra-high-frequency content (>16kHz) may be slightly attenuated
- Very short transients may be subtly softened
- The blend control exists specifically so you can balance these trade-offs
Files
| File | Format | Precision | Size | Use With |
|---|---|---|---|---|
pp-vae-F32.gguf |
GGUF | F32 (full) | 644 MB | Best quality β recommended for post-processing |
pp-vae-BF16.gguf |
GGUF | BF16 | 322 MB | Good balance of quality and size |
pp-vae-F16.gguf |
GGUF | F16 | 322 MB | Alternative half-precision |
Recommended: Use pp-vae-F32.gguf for post-processing. Since PP-VAE runs once per song (not iteratively like DiT), the extra precision is worth the memory cost.
Usage
acestep.cpp / HOT-Step CPP
Place the GGUF file in your models directory:
models/
βββ acestep-v15-turbo-BF16.gguf # DiT
βββ acestep-5Hz-lm-BF16.gguf # LM
βββ Qwen3-Embedding-BF16.gguf # Text encoder
βββ vae-BF16.gguf # Primary VAE (for generation)
βββ pp-vae-F32.gguf # β PP-VAE (for post-processing)
The engine auto-detects PP-VAE models by their GGUF architecture tag (pp-vae). In HOT-Step CPP, enable PP-VAE Re-encode in the Post-Processing panel and adjust the blend slider.
Standalone API
The C++ engine exposes a synchronous endpoint:
# Full PP-VAE processing
curl -X POST http://localhost:8085/pp-vae-reencode \
--data-binary @input.wav \
-H "Content-Type: audio/wav" \
-o output.wav
# With 30% original blend
curl -X POST "http://localhost:8085/pp-vae-reencode?blend=0.3" \
--data-binary @input.wav \
-H "Content-Type: audio/wav" \
-o output.wav
Blend parameter: 0.0 = fully PP-VAE processed, 1.0 = fully original audio.
Python (PyTorch)
If you want to use the original checkpoint directly in Python:
import torch
from reencode import OobleckEncoder, OobleckDecoder, load_wav, save_wav
# Download original checkpoint from:
# https://huggingface.co/tencent/SongGeneration/resolve/main/ckpt/vae/autoencoder_music_1320k.ckpt
ckpt = torch.load("autoencoder_music_1320k.ckpt", map_location="cpu", weights_only=False)
sd = ckpt["state_dict"]
# Build encoder (input: 2ch audio β 128-dim latent, split to 64-dim mean)
encoder = OobleckEncoder(in_channels=2, channels=128, latent_dim=128,
c_mults=[1,2,4,8,16], strides=[2,4,4,6,10])
enc_sd = {k.replace("encoder.", ""): v for k, v in sd.items() if k.startswith("encoder.")}
encoder.load_state_dict(enc_sd)
# Build decoder (input: 64-dim latent β 2ch audio)
decoder = OobleckDecoder(out_channels=2, channels=128, latent_dim=64,
c_mults=[1,2,4,8,16], strides=[2,4,4,6,10])
dec_sd = {k.replace("decoder.", ""): v for k, v in sd.items() if k.startswith("decoder.")}
decoder.load_state_dict(dec_sd)
# Re-encode
audio = load_wav("input.wav") # [1, 2, T] @ 48kHz
with torch.no_grad():
latent = encoder(audio.half().cuda())
mean, _ = latent.chunk(2, dim=1) # Use mean only (no sampling)
output = decoder(mean.half().cuda()).float().cpu()
# RMS gain match and save
input_rms = audio.pow(2).mean().sqrt()
output_rms = output.pow(2).mean().sqrt()
output = output * (input_rms / output_rms)
save_wav("output.wav", output)
Architecture
PP-VAE uses the AutoencoderOobleck architecture β the same family used by ACE-Step and Stable Audio.
| Parameter | Value |
|---|---|
| Architecture | AutoencoderOobleck |
| Source checkpoint | autoencoder_music_1320k.ckpt |
| Training steps | 1,320,000 |
| Audio channels | 2 (stereo) |
| Sample rate | 48,000 Hz |
| Encoder latent dim | 128 (split β 64 mean + 64 logvar) |
| Decoder latent dim | 64 |
| Base channels | 128 |
| Channel multipliers | [1, 2, 4, 8, 16] |
| Downsampling ratios | [2, 4, 4, 6, 10] |
| Total compression ratio | 1920Γ |
| Activation | Snake (Ξ±, Ξ² in log-space) |
| Weight normalisation | Yes (weight_v + weight_g parametrisation) |
| Parameters | ~168.7M (encoder + decoder) |
| GGUF architecture tag | pp-vae |
Network Structure
Encoder:
Conv1d(2β128, k=7) β 5Γ [3ΓResUnit + Snake + StridedConv] β Snake β Conv1d(2048β128, k=3)
Strides: [2, 4, 4, 6, 10] β total 1920Γ downsampling
Decoder:
Conv1d(64β2048, k=7) β 5Γ [Snake + ConvTranspose1d + 3ΓResUnit] β Snake β Conv1d(128β2, k=7)
Strides: [10, 6, 4, 4, 2] β total 1920Γ upsampling
ResUnit:
Snake β DilatedConv1d(k=7) β Snake β Conv1d(k=1) β + residual
Dilations per block: [1, 3, 9]
Processing Pipeline
The PP-VAE re-encode pipeline applies these steps:
- Encode β Run input audio through the encoder with tiled processing (chunk=256, overlap=64 latent frames) to manage GPU memory
- Latent extraction β Take the mean of the encoder output (discard logvar β deterministic re-encoding, no sampling)
- Decode β Run latents through the decoder with tiled processing
- RMS gain match β Scale output to match input RMS level, capped at input peak to prevent clipping
- Blend β Mix original and processed audio according to the blend parameter
Compatibility
- β acestep.cpp β GGUF inference engine
- β HOT-Step CPP β Full music generation UI with PP-VAE toggle
- β HOT-Step 9000 β Python pipeline (via original .ckpt)
- β Any AudioencoderOobleck-compatible pipeline (PyTorch/Diffusers)
- β Works with all ACE-Step DiT checkpoints (standard, turbo, SFT, XL)
- β Works alongside all LoRA/LoKR adapters
License
This model is derived from Tencent AI Lab's SongGeneration project and is subject to the SongGeneration License.
The license restricts usage to academic, research, and education purposes only. Commercial or production use is explicitly prohibited.
The GGUF conversion is a format transformation of the original autoencoder_music_1320k.ckpt weights β no retraining or weight modification was performed. All original license terms apply to these converted files.
Copyright (C) 2025 Tencent. All rights reserved.
GGUF Conversion
The GGUF files were produced by a custom converter that:
- Loads the PyTorch Lightning checkpoint (
autoencoder_music_1320k.ckpt) - Maps LeVo's
encoder.layers.*/decoder.layers.*key naming to the acestep.cpp GGUF tensor naming convention - Reshapes Snake activation parameters from
[C]to[1, C, 1] - Exports both encoder and decoder in a single GGUF file with architecture tag
pp-vae - Supports F32, BF16, and F16 precision variants
Citation
If you use PP-VAE in your work, please cite the original SongGeneration project:
@article{levo2025,
title={SongGeneration: A Song Generation System with Lyrics and Accompaniment},
author={Tencent AI Lab},
year={2025},
url={https://github.com/tencent-ailab/SongGeneration}
}
Acknowledgements
- Tencent AI Lab / SongGeneration β Original VAE model and training
- ACE-Step β Music generation pipeline
- acestep.cpp β C++ inference engine with GGUF support
- stable-audio-tools β AutoencoderOobleck architecture
- HOT-Step β Integration and GGUF conversion
- Downloads last month
- 484
16-bit
32-bit