llm-codec / README.md
voidful's picture
Update README.md
a0ef01c verified
---
base_model:
- Qwen/Qwen3-4B-Instruct-2507
library_name: transformers
pipeline_tag: text-generation
tags:
- audio
- speech
- audio-codec
- neural-audio-codec
- spoken-language-modeling
- codec-superb
- qwen3
datasets:
- librispeech_asr
metrics:
- perplexity
- pesq
- stoi
---
# LLM-Codec
LLM-Codec is a neural audio codec checkpoint trained to produce discrete audio
tokens that are both reconstructable and easier for autoregressive language
models to predict.
Model: https://huggingface.co/voidful/llm-codec
Code: https://github.com/voidful/llm-codec
Usage reference: https://github.com/voidful/Codec-SUPERB
## Model Description
Most neural audio codecs are trained for waveform reconstruction. Spoken
language models, however, consume codec tokens with a next-token prediction
objective. This mismatch can make acoustically valid variation appear as token
uncertainty to the language model.
LLM-Codec adapts a codec with language-model-facing objectives while keeping the
deployed codec interface unchanged. The model is trained with:
- Future Token Prediction (FTP): Medusa-style heads predict future audio tokens
from frozen-LLM hidden states.
- Semantic Alignment (SA): audio-induced hidden states are aligned with paired
text hidden states inside a frozen LLM.
- Differentiable Gumbel bridge: hard Gumbel-Softmax keeps discrete forward
tokens while enabling gradients to flow to the codec encoder.
- Reconstruction losses: mel, multi-scale mel, multi-resolution STFT, complex
STFT, VQ, GAN, and feature matching losses.
The deployed codec does not require the auxiliary FTP heads.
## Intended Use
This model is intended for research and development in:
- audio tokenization for spoken language modeling
- codec reconstruction experiments
- token-level speech LM training
- Codec-SUPERB style codec evaluation
- speech token analysis and ablation studies
It is not a full text-to-speech system by itself. For speech generation, use the
codec as the tokenizer/decoder inside a separate speech language modeling
pipeline.
## Out-of-Scope Use
Do not use this model for:
- impersonation or unauthorized voice cloning
- surveillance or speaker tracking without consent
- high-stakes speaker, language, or identity decisions
- generating deceptive audio content
## Installation
The easiest inference path is through the Codec-SUPERB `SoundCodec` interface.
```bash
git clone https://github.com/voidful/Codec-SUPERB.git
cd Codec-SUPERB
pip install -r requirements.txt
export PYTHONPATH=$PWD:$PYTHONPATH
```
If your environment supports editable installs, this is also convenient:
```bash
pip install -e .
```
## Quick Start
Load LLM-Codec through the Codec-SUPERB codec registry:
```python
from SoundCodec import codec
print(codec.list_codec())
model = codec.load_codec("llmcodec")
```
Encode and reconstruct one audio file:
```python
from SoundCodec import codec
import torchaudio
import soundfile as sf
model = codec.load_codec("llmcodec")
waveform, sample_rate = torchaudio.load("sample_audio.wav")
data_item = {
"audio": {
"array": waveform.numpy()[0],
"sampling_rate": sample_rate,
}
}
units = model.extract_unit(data_item).unit
print("Unit shape:", units.shape)
result = model.synth(data_item, local_save=False)
reconstructed = result["audio"]["array"]
reconstructed_sr = result["audio"].get("sampling_rate", sample_rate)
sf.write("reconstructed.wav", reconstructed, reconstructed_sr)
```
## Batch Usage
Codec-SUPERB also provides batch APIs:
```python
from SoundCodec import codec
import torchaudio
model = codec.load_codec("llmcodec")
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
data_list = []
for path in audio_files:
waveform, sample_rate = torchaudio.load(path)
data_list.append({
"id": path,
"audio": {
"array": waveform.numpy()[0],
"sampling_rate": sample_rate,
},
})
batch_units = model.batch_extract_unit(data_list)
batch_audio = model.batch_decode_unit(batch_units)
results = model.batch_synth(data_list, local_save=False)
for item in results:
print(item["unit"].shape, item["audio"]["array"].shape)
```
For better throughput, group audio samples with similar lengths before batching.
## Codec-SUPERB Evaluation
To evaluate LLM-Codec with Codec-SUPERB-tiny:
```bash
PYTHONPATH=. python3 scripts/dataset_creator.py \
--dataset voidful/codec-superb-tiny
PYTHONPATH=. python3 scripts/benchmarking.py \
--dataset datasets/voidful/codec-superb-tiny_synth \
--models llmcodec
```
## Model Files
The model repository provides:
- codec weights as `llm-codec.pt`
- a tokenizer extended with `<CODEC_*>` audio tokens
- Qwen-compatible model artifacts containing trained audio-token embeddings
The codec uses 20,480 audio tokens with the canonical token format:
```text
<CODEC_0>, <CODEC_1>, ..., <CODEC_20479>
```
## Training Data
The codec was trained on LibriSpeech `train-clean-100` with paired transcripts.
The validation split used during training is LibriSpeech `validation`.
Because training is speech-centric and transcript-supervised, performance may be
weaker on non-English speech, conversational speech, music, environmental audio,
or audio with strong noise and overlap.
## Training Procedure
Base components:
- Base codec: AUV
- Frozen LLM backbone: Qwen3-4B-Instruct
- Token rate: 50 Hz
- Audio vocabulary size: 20,480
- Segment length: 4 seconds
Losses:
- reconstruction mel loss
- multi-scale mel loss
- multi-resolution STFT loss
- complex STFT loss with phase term
- VQ commitment loss
- Gumbel bridge cross entropy
- Future Token Prediction loss
- Semantic Alignment cosine loss
- Semantic Alignment contrastive loss with memory bank
- MPD/MSD GAN and feature matching losses
## Evaluation Results
### Token Learnability
SALMon speech coherence accuracy after token-level LM training:
| Tokenizer | Overall accuracy |
| --- | ---: |
| WavTok-L | 48.3 |
| BigCodec | 49.4 |
| UniCodec | 50.1 |
| AUV | 49.4 |
| LLM-Codec | 61.6 |
Token-level perplexity on LibriSpeech after 3 epochs of LM training:
| Tokenizer | Eval loss | Perplexity |
| --- | ---: | ---: |
| WavTok-L | 11.91 | 148,122 |
| UniCodec | 11.92 | 150,197 |
| BigCodec | 11.96 | 156,448 |
| AUV | 11.98 | 159,768 |
| LLM-Codec | 8.44 | 4,617 |
### Reconstruction Quality
Codec-SUPERB-tiny speech reconstruction:
| Model | Mel lower is better | STFT lower is better | PESQ higher is better | STOI higher is better |
| --- | ---: | ---: | ---: | ---: |
| AUV base | 0.762 | 1.648 | 2.094 | 0.850 |
| LLM-Codec | 0.724 | 1.599 | 2.102 | 0.859 |
## Limitations
- The semantic alignment objective depends on paired speech and text.
- The model is primarily validated on read speech.
- Downstream generation quality depends on the separate speech language model.
- The model may preserve speaker identity information present in the input.
- The Hugging Face `transformers` artifacts are not a standalone text chatbot;
they accompany the codec/tokenizer workflow.
## Citation
```bibtex
@article{chung2026llm,
title={LLM-Codec: Neural Audio Codec Meets Language Model Objectives},
author={Chung, Ho-Lam and Chen, Yiming and Lee, Hung-yi},
journal={arXiv preprint arXiv:2604.17852},
note = {Model and code available at https://github.com/voidful/llm-codec},
year={2026}
}
```
If you use the Codec-SUPERB interface or benchmark, please also cite
Codec-SUPERB:
```bibtex
@inproceedings{wu-etal-2024-codec,
title = {Codec-SUPERB: An In-Depth Analysis of Sound Codec Models},
author = {Wu, Haibin and Chung, Ho-Lam and Lin, Yi-Cheng and Wu, Yuan-Kuei and Chen, Xuanjun and Pai, Yu-Chi and Wang, Hsiu-Hsuan and Chang, Kai-Wei and Liu, Alexander and Lee, Hung-yi},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
year = {2024},
url = {https://aclanthology.org/2024.findings-acl.616},
doi = {10.18653/v1/2024.findings-acl.616},
pages = {10330--10348}
}
```