PP-VAE — Post-Processing VAE for ACE-Step Music Generation

A GGUF conversion of the AutoencoderOobleck VAE from Tencent AI Lab's SongGeneration (LeVo) project, repurposed as a post-processing audio re-encoder for the ACE-Step 1.5 music generation pipeline.

TL;DR: Run your generated audio through this VAE's encode→decode round-trip to get spectral cleanup, reduced artifacts, and improved tonal coherence — like a neural audio polish pass.

What Does PP-VAE Do?

PP-VAE (Post-Processing VAE) performs a full encode→decode cycle on already-generated audio:

Generated WAV → PP-VAE Encoder → Latent Space → PP-VAE Decoder → Cleaned WAV

Because this VAE was trained on high-quality music (1.32M steps by Tencent AI Lab), the round-trip acts as a learned spectral filter — the encoder projects audio into a compressed latent representation, and the decoder reconstructs it, implicitly smoothing synthesis artifacts and tightening spectral coherence.

Why Use a Second VAE?

ACE-Step already has its own VAE for the primary generation pipeline. The PP-VAE is a different model from a different training regime (SongGeneration/LeVo), and using it as a post-processing step provides:

Artifact reduction — smooths out spectral irregularities from the primary VAE decode
Tonal coherence — the LeVo VAE was trained on a large, diverse music corpus, so its latent space provides a natural "spectral prior" that can regularise generated audio
Configurable blending — a wet/dry mix control lets you dial in exactly how much PP-VAE processing to apply (0% = fully processed, 100% = fully original)
RMS gain matching — automatic loudness normalisation ensures the output matches the input's dynamic envelope

What It Won't Do

PP-VAE is not a magical quality enhancer. It's a lossy round-trip through a neural codec, so:

Ultra-high-frequency content (>16kHz) may be slightly attenuated
Very short transients may be subtly softened
The blend control exists specifically so you can balance these trade-offs

Files

File	Format	Precision	Size	Use With
`pp-vae-F32.gguf`	GGUF	F32 (full)	644 MB	Best quality — recommended for post-processing
`pp-vae-BF16.gguf`	GGUF	BF16	322 MB	Good balance of quality and size
`pp-vae-F16.gguf`	GGUF	F16	322 MB	Alternative half-precision

Recommended: Use pp-vae-F32.gguf for post-processing. Since PP-VAE runs once per song (not iteratively like DiT), the extra precision is worth the memory cost.

Usage

acestep.cpp / HOT-Step CPP

Place the GGUF file in your models directory:

models/
├── acestep-v15-turbo-BF16.gguf    # DiT
├── acestep-5Hz-lm-BF16.gguf       # LM
├── Qwen3-Embedding-BF16.gguf      # Text encoder
├── vae-BF16.gguf                  # Primary VAE (for generation)
└── pp-vae-F32.gguf                # ← PP-VAE (for post-processing)

The engine auto-detects PP-VAE models by their GGUF architecture tag (pp-vae). In HOT-Step CPP, enable PP-VAE Re-encode in the Post-Processing panel and adjust the blend slider.

Standalone API

The C++ engine exposes a synchronous endpoint:

# Full PP-VAE processing
curl -X POST http://localhost:8085/pp-vae-reencode \
  --data-binary @input.wav \
  -H "Content-Type: audio/wav" \
  -o output.wav

# With 30% original blend
curl -X POST "http://localhost:8085/pp-vae-reencode?blend=0.3" \
  --data-binary @input.wav \
  -H "Content-Type: audio/wav" \
  -o output.wav

Blend parameter: 0.0 = fully PP-VAE processed, 1.0 = fully original audio.

Python (PyTorch)

If you want to use the original checkpoint directly in Python:

import torch
from reencode import OobleckEncoder, OobleckDecoder, load_wav, save_wav

# Download original checkpoint from:
# https://huggingface.co/tencent/SongGeneration/resolve/main/ckpt/vae/autoencoder_music_1320k.ckpt

ckpt = torch.load("autoencoder_music_1320k.ckpt", map_location="cpu", weights_only=False)
sd = ckpt["state_dict"]

# Build encoder (input: 2ch audio → 128-dim latent, split to 64-dim mean)
encoder = OobleckEncoder(in_channels=2, channels=128, latent_dim=128,
                         c_mults=[1,2,4,8,16], strides=[2,4,4,6,10])
enc_sd = {k.replace("encoder.", ""): v for k, v in sd.items() if k.startswith("encoder.")}
encoder.load_state_dict(enc_sd)

# Build decoder (input: 64-dim latent → 2ch audio)
decoder = OobleckDecoder(out_channels=2, channels=128, latent_dim=64,
                         c_mults=[1,2,4,8,16], strides=[2,4,4,6,10])
dec_sd = {k.replace("decoder.", ""): v for k, v in sd.items() if k.startswith("decoder.")}
decoder.load_state_dict(dec_sd)

# Re-encode
audio = load_wav("input.wav")  # [1, 2, T] @ 48kHz
with torch.no_grad():
    latent = encoder(audio.half().cuda())
    mean, _ = latent.chunk(2, dim=1)  # Use mean only (no sampling)
    output = decoder(mean.half().cuda()).float().cpu()

# RMS gain match and save
input_rms = audio.pow(2).mean().sqrt()
output_rms = output.pow(2).mean().sqrt()
output = output * (input_rms / output_rms)
save_wav("output.wav", output)

Architecture

PP-VAE uses the AutoencoderOobleck architecture — the same family used by ACE-Step and Stable Audio.

Parameter	Value
Architecture	AutoencoderOobleck
Source checkpoint	`autoencoder_music_1320k.ckpt`
Training steps	1,320,000
Audio channels	2 (stereo)
Sample rate	48,000 Hz
Encoder latent dim	128 (split → 64 mean + 64 logvar)
Decoder latent dim	64
Base channels	128
Channel multipliers	[1, 2, 4, 8, 16]
Downsampling ratios	[2, 4, 4, 6, 10]
Total compression ratio	1920×
Activation	Snake (α, β in log-space)
Weight normalisation	Yes (weight_v + weight_g parametrisation)
Parameters	~168.7M (encoder + decoder)
GGUF architecture tag	`pp-vae`

Network Structure

Encoder:
  Conv1d(2→128, k=7) → 5× [3×ResUnit + Snake + StridedConv] → Snake → Conv1d(2048→128, k=3)
  Strides: [2, 4, 4, 6, 10] → total 1920× downsampling

Decoder:
  Conv1d(64→2048, k=7) → 5× [Snake + ConvTranspose1d + 3×ResUnit] → Snake → Conv1d(128→2, k=7)
  Strides: [10, 6, 4, 4, 2] → total 1920× upsampling

ResUnit:
  Snake → DilatedConv1d(k=7) → Snake → Conv1d(k=1) → + residual
  Dilations per block: [1, 3, 9]

Processing Pipeline

The PP-VAE re-encode pipeline applies these steps:

Encode — Run input audio through the encoder with tiled processing (chunk=256, overlap=64 latent frames) to manage GPU memory
Latent extraction — Take the mean of the encoder output (discard logvar — deterministic re-encoding, no sampling)
Decode — Run latents through the decoder with tiled processing
RMS gain match — Scale output to match input RMS level, capped at input peak to prevent clipping
Blend — Mix original and processed audio according to the blend parameter

Compatibility

✅ acestep.cpp — GGUF inference engine
✅ HOT-Step CPP — Full music generation UI with PP-VAE toggle
✅ HOT-Step 9000 — Python pipeline (via original .ckpt)
✅ Any AudioencoderOobleck-compatible pipeline (PyTorch/Diffusers)
✅ Works with all ACE-Step DiT checkpoints (standard, turbo, SFT, XL)
✅ Works alongside all LoRA/LoKR adapters

License

This model is derived from Tencent AI Lab's SongGeneration project and is subject to the SongGeneration License.

The license restricts usage to academic, research, and education purposes only. Commercial or production use is explicitly prohibited.

The GGUF conversion is a format transformation of the original autoencoder_music_1320k.ckpt weights — no retraining or weight modification was performed. All original license terms apply to these converted files.

GGUF Conversion

The GGUF files were produced by a custom converter that:

Loads the PyTorch Lightning checkpoint (autoencoder_music_1320k.ckpt)
Maps LeVo's encoder.layers.* / decoder.layers.* key naming to the acestep.cpp GGUF tensor naming convention
Reshapes Snake activation parameters from [C] to [1, C, 1]
Exports both encoder and decoder in a single GGUF file with architecture tag pp-vae
Supports F32, BF16, and F16 precision variants

Citation

If you use PP-VAE in your work, please cite the original SongGeneration project:

@article{levo2025,
  title={SongGeneration: A Song Generation System with Lyrics and Accompaniment},
  author={Tencent AI Lab},
  year={2025},
  url={https://github.com/tencent-ailab/SongGeneration}
}

Acknowledgements

Tencent AI Lab / SongGeneration — Original VAE model and training
ACE-Step — Music generation pipeline
acestep.cpp — C++ inference engine with GGUF support
stable-audio-tools — AutoencoderOobleck architecture
HOT-Step — Integration and GGUF conversion

Downloads last month: 484

GGUF

Model size

0.2B params

Architecture

pp-vae

Hardware compatibility

16-bit

32-bit