PP-VAE β€” Post-Processing VAE for ACE-Step Music Generation

A GGUF conversion of the AutoencoderOobleck VAE from Tencent AI Lab's SongGeneration (LeVo) project, repurposed as a post-processing audio re-encoder for the ACE-Step 1.5 music generation pipeline.

TL;DR: Run your generated audio through this VAE's encode→decode round-trip to get spectral cleanup, reduced artifacts, and improved tonal coherence — like a neural audio polish pass.

What Does PP-VAE Do?

PP-VAE (Post-Processing VAE) performs a full encode→decode cycle on already-generated audio:

Generated WAV β†’ PP-VAE Encoder β†’ Latent Space β†’ PP-VAE Decoder β†’ Cleaned WAV

Because this VAE was trained on high-quality music (1.32M steps by Tencent AI Lab), the round-trip acts as a learned spectral filter β€” the encoder projects audio into a compressed latent representation, and the decoder reconstructs it, implicitly smoothing synthesis artifacts and tightening spectral coherence.

Why Use a Second VAE?

ACE-Step already has its own VAE for the primary generation pipeline. The PP-VAE is a different model from a different training regime (SongGeneration/LeVo), and using it as a post-processing step provides:

  • Artifact reduction β€” smooths out spectral irregularities from the primary VAE decode
  • Tonal coherence β€” the LeVo VAE was trained on a large, diverse music corpus, so its latent space provides a natural "spectral prior" that can regularise generated audio
  • Configurable blending β€” a wet/dry mix control lets you dial in exactly how much PP-VAE processing to apply (0% = fully processed, 100% = fully original)
  • RMS gain matching β€” automatic loudness normalisation ensures the output matches the input's dynamic envelope

What It Won't Do

PP-VAE is not a magical quality enhancer. It's a lossy round-trip through a neural codec, so:

  • Ultra-high-frequency content (>16kHz) may be slightly attenuated
  • Very short transients may be subtly softened
  • The blend control exists specifically so you can balance these trade-offs

Files

File Format Precision Size Use With
pp-vae-F32.gguf GGUF F32 (full) 644 MB Best quality β€” recommended for post-processing
pp-vae-BF16.gguf GGUF BF16 322 MB Good balance of quality and size
pp-vae-F16.gguf GGUF F16 322 MB Alternative half-precision

Recommended: Use pp-vae-F32.gguf for post-processing. Since PP-VAE runs once per song (not iteratively like DiT), the extra precision is worth the memory cost.

Usage

acestep.cpp / HOT-Step CPP

Place the GGUF file in your models directory:

models/
β”œβ”€β”€ acestep-v15-turbo-BF16.gguf    # DiT
β”œβ”€β”€ acestep-5Hz-lm-BF16.gguf       # LM
β”œβ”€β”€ Qwen3-Embedding-BF16.gguf      # Text encoder
β”œβ”€β”€ vae-BF16.gguf                  # Primary VAE (for generation)
└── pp-vae-F32.gguf                # ← PP-VAE (for post-processing)

The engine auto-detects PP-VAE models by their GGUF architecture tag (pp-vae). In HOT-Step CPP, enable PP-VAE Re-encode in the Post-Processing panel and adjust the blend slider.

Standalone API

The C++ engine exposes a synchronous endpoint:

# Full PP-VAE processing
curl -X POST http://localhost:8085/pp-vae-reencode \
  --data-binary @input.wav \
  -H "Content-Type: audio/wav" \
  -o output.wav

# With 30% original blend
curl -X POST "http://localhost:8085/pp-vae-reencode?blend=0.3" \
  --data-binary @input.wav \
  -H "Content-Type: audio/wav" \
  -o output.wav

Blend parameter: 0.0 = fully PP-VAE processed, 1.0 = fully original audio.

Python (PyTorch)

If you want to use the original checkpoint directly in Python:

import torch
from reencode import OobleckEncoder, OobleckDecoder, load_wav, save_wav

# Download original checkpoint from:
# https://huggingface.co/tencent/SongGeneration/resolve/main/ckpt/vae/autoencoder_music_1320k.ckpt

ckpt = torch.load("autoencoder_music_1320k.ckpt", map_location="cpu", weights_only=False)
sd = ckpt["state_dict"]

# Build encoder (input: 2ch audio β†’ 128-dim latent, split to 64-dim mean)
encoder = OobleckEncoder(in_channels=2, channels=128, latent_dim=128,
                         c_mults=[1,2,4,8,16], strides=[2,4,4,6,10])
enc_sd = {k.replace("encoder.", ""): v for k, v in sd.items() if k.startswith("encoder.")}
encoder.load_state_dict(enc_sd)

# Build decoder (input: 64-dim latent β†’ 2ch audio)
decoder = OobleckDecoder(out_channels=2, channels=128, latent_dim=64,
                         c_mults=[1,2,4,8,16], strides=[2,4,4,6,10])
dec_sd = {k.replace("decoder.", ""): v for k, v in sd.items() if k.startswith("decoder.")}
decoder.load_state_dict(dec_sd)

# Re-encode
audio = load_wav("input.wav")  # [1, 2, T] @ 48kHz
with torch.no_grad():
    latent = encoder(audio.half().cuda())
    mean, _ = latent.chunk(2, dim=1)  # Use mean only (no sampling)
    output = decoder(mean.half().cuda()).float().cpu()

# RMS gain match and save
input_rms = audio.pow(2).mean().sqrt()
output_rms = output.pow(2).mean().sqrt()
output = output * (input_rms / output_rms)
save_wav("output.wav", output)

Architecture

PP-VAE uses the AutoencoderOobleck architecture β€” the same family used by ACE-Step and Stable Audio.

Parameter Value
Architecture AutoencoderOobleck
Source checkpoint autoencoder_music_1320k.ckpt
Training steps 1,320,000
Audio channels 2 (stereo)
Sample rate 48,000 Hz
Encoder latent dim 128 (split β†’ 64 mean + 64 logvar)
Decoder latent dim 64
Base channels 128
Channel multipliers [1, 2, 4, 8, 16]
Downsampling ratios [2, 4, 4, 6, 10]
Total compression ratio 1920Γ—
Activation Snake (Ξ±, Ξ² in log-space)
Weight normalisation Yes (weight_v + weight_g parametrisation)
Parameters ~168.7M (encoder + decoder)
GGUF architecture tag pp-vae

Network Structure

Encoder:
  Conv1d(2β†’128, k=7) β†’ 5Γ— [3Γ—ResUnit + Snake + StridedConv] β†’ Snake β†’ Conv1d(2048β†’128, k=3)
  Strides: [2, 4, 4, 6, 10] β†’ total 1920Γ— downsampling

Decoder:
  Conv1d(64β†’2048, k=7) β†’ 5Γ— [Snake + ConvTranspose1d + 3Γ—ResUnit] β†’ Snake β†’ Conv1d(128β†’2, k=7)
  Strides: [10, 6, 4, 4, 2] β†’ total 1920Γ— upsampling

ResUnit:
  Snake β†’ DilatedConv1d(k=7) β†’ Snake β†’ Conv1d(k=1) β†’ + residual
  Dilations per block: [1, 3, 9]

Processing Pipeline

The PP-VAE re-encode pipeline applies these steps:

  1. Encode β€” Run input audio through the encoder with tiled processing (chunk=256, overlap=64 latent frames) to manage GPU memory
  2. Latent extraction β€” Take the mean of the encoder output (discard logvar β€” deterministic re-encoding, no sampling)
  3. Decode β€” Run latents through the decoder with tiled processing
  4. RMS gain match β€” Scale output to match input RMS level, capped at input peak to prevent clipping
  5. Blend β€” Mix original and processed audio according to the blend parameter

Compatibility

  • βœ… acestep.cpp β€” GGUF inference engine
  • βœ… HOT-Step CPP β€” Full music generation UI with PP-VAE toggle
  • βœ… HOT-Step 9000 β€” Python pipeline (via original .ckpt)
  • βœ… Any AudioencoderOobleck-compatible pipeline (PyTorch/Diffusers)
  • βœ… Works with all ACE-Step DiT checkpoints (standard, turbo, SFT, XL)
  • βœ… Works alongside all LoRA/LoKR adapters

License

This model is derived from Tencent AI Lab's SongGeneration project and is subject to the SongGeneration License.

The license restricts usage to academic, research, and education purposes only. Commercial or production use is explicitly prohibited.

The GGUF conversion is a format transformation of the original autoencoder_music_1320k.ckpt weights β€” no retraining or weight modification was performed. All original license terms apply to these converted files.

Copyright (C) 2025 Tencent. All rights reserved.

GGUF Conversion

The GGUF files were produced by a custom converter that:

  1. Loads the PyTorch Lightning checkpoint (autoencoder_music_1320k.ckpt)
  2. Maps LeVo's encoder.layers.* / decoder.layers.* key naming to the acestep.cpp GGUF tensor naming convention
  3. Reshapes Snake activation parameters from [C] to [1, C, 1]
  4. Exports both encoder and decoder in a single GGUF file with architecture tag pp-vae
  5. Supports F32, BF16, and F16 precision variants

Citation

If you use PP-VAE in your work, please cite the original SongGeneration project:

@article{levo2025,
  title={SongGeneration: A Song Generation System with Lyrics and Accompaniment},
  author={Tencent AI Lab},
  year={2025},
  url={https://github.com/tencent-ailab/SongGeneration}
}

Acknowledgements

Downloads last month
484
GGUF
Model size
0.2B params
Architecture
pp-vae
Hardware compatibility
Log In to add your hardware

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support