| --- |
| base_model: |
| - Qwen/Qwen3-4B-Instruct-2507 |
| library_name: transformers |
| pipeline_tag: text-generation |
| tags: |
| - audio |
| - speech |
| - audio-codec |
| - neural-audio-codec |
| - spoken-language-modeling |
| - codec-superb |
| - qwen3 |
| datasets: |
| - librispeech_asr |
| metrics: |
| - perplexity |
| - pesq |
| - stoi |
| --- |
| |
| # LLM-Codec |
|
|
| LLM-Codec is a neural audio codec checkpoint trained to produce discrete audio |
| tokens that are both reconstructable and easier for autoregressive language |
| models to predict. |
|
|
| Model: https://huggingface.co/voidful/llm-codec |
|
|
| Code: https://github.com/voidful/llm-codec |
|
|
| Usage reference: https://github.com/voidful/Codec-SUPERB |
|
|
| ## Model Description |
|
|
| Most neural audio codecs are trained for waveform reconstruction. Spoken |
| language models, however, consume codec tokens with a next-token prediction |
| objective. This mismatch can make acoustically valid variation appear as token |
| uncertainty to the language model. |
|
|
| LLM-Codec adapts a codec with language-model-facing objectives while keeping the |
| deployed codec interface unchanged. The model is trained with: |
|
|
| - Future Token Prediction (FTP): Medusa-style heads predict future audio tokens |
| from frozen-LLM hidden states. |
| - Semantic Alignment (SA): audio-induced hidden states are aligned with paired |
| text hidden states inside a frozen LLM. |
| - Differentiable Gumbel bridge: hard Gumbel-Softmax keeps discrete forward |
| tokens while enabling gradients to flow to the codec encoder. |
| - Reconstruction losses: mel, multi-scale mel, multi-resolution STFT, complex |
| STFT, VQ, GAN, and feature matching losses. |
|
|
| The deployed codec does not require the auxiliary FTP heads. |
|
|
| ## Intended Use |
|
|
| This model is intended for research and development in: |
|
|
| - audio tokenization for spoken language modeling |
| - codec reconstruction experiments |
| - token-level speech LM training |
| - Codec-SUPERB style codec evaluation |
| - speech token analysis and ablation studies |
|
|
| It is not a full text-to-speech system by itself. For speech generation, use the |
| codec as the tokenizer/decoder inside a separate speech language modeling |
| pipeline. |
|
|
| ## Out-of-Scope Use |
|
|
| Do not use this model for: |
|
|
| - impersonation or unauthorized voice cloning |
| - surveillance or speaker tracking without consent |
| - high-stakes speaker, language, or identity decisions |
| - generating deceptive audio content |
|
|
| ## Installation |
|
|
| The easiest inference path is through the Codec-SUPERB `SoundCodec` interface. |
|
|
| ```bash |
| git clone https://github.com/voidful/Codec-SUPERB.git |
| cd Codec-SUPERB |
| pip install -r requirements.txt |
| export PYTHONPATH=$PWD:$PYTHONPATH |
| ``` |
|
|
| If your environment supports editable installs, this is also convenient: |
|
|
| ```bash |
| pip install -e . |
| ``` |
|
|
| ## Quick Start |
|
|
| Load LLM-Codec through the Codec-SUPERB codec registry: |
|
|
| ```python |
| from SoundCodec import codec |
| |
| print(codec.list_codec()) |
| model = codec.load_codec("llmcodec") |
| ``` |
|
|
| Encode and reconstruct one audio file: |
|
|
| ```python |
| from SoundCodec import codec |
| import torchaudio |
| import soundfile as sf |
| |
| model = codec.load_codec("llmcodec") |
| |
| waveform, sample_rate = torchaudio.load("sample_audio.wav") |
| data_item = { |
| "audio": { |
| "array": waveform.numpy()[0], |
| "sampling_rate": sample_rate, |
| } |
| } |
| |
| units = model.extract_unit(data_item).unit |
| print("Unit shape:", units.shape) |
| |
| result = model.synth(data_item, local_save=False) |
| reconstructed = result["audio"]["array"] |
| reconstructed_sr = result["audio"].get("sampling_rate", sample_rate) |
| |
| sf.write("reconstructed.wav", reconstructed, reconstructed_sr) |
| ``` |
|
|
| ## Batch Usage |
|
|
| Codec-SUPERB also provides batch APIs: |
|
|
| ```python |
| from SoundCodec import codec |
| import torchaudio |
| |
| model = codec.load_codec("llmcodec") |
| |
| audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"] |
| data_list = [] |
| |
| for path in audio_files: |
| waveform, sample_rate = torchaudio.load(path) |
| data_list.append({ |
| "id": path, |
| "audio": { |
| "array": waveform.numpy()[0], |
| "sampling_rate": sample_rate, |
| }, |
| }) |
| |
| batch_units = model.batch_extract_unit(data_list) |
| batch_audio = model.batch_decode_unit(batch_units) |
| |
| results = model.batch_synth(data_list, local_save=False) |
| for item in results: |
| print(item["unit"].shape, item["audio"]["array"].shape) |
| ``` |
|
|
| For better throughput, group audio samples with similar lengths before batching. |
|
|
| ## Codec-SUPERB Evaluation |
|
|
| To evaluate LLM-Codec with Codec-SUPERB-tiny: |
|
|
| ```bash |
| PYTHONPATH=. python3 scripts/dataset_creator.py \ |
| --dataset voidful/codec-superb-tiny |
| |
| PYTHONPATH=. python3 scripts/benchmarking.py \ |
| --dataset datasets/voidful/codec-superb-tiny_synth \ |
| --models llmcodec |
| ``` |
|
|
| ## Model Files |
|
|
| The model repository provides: |
|
|
| - codec weights as `llm-codec.pt` |
| - a tokenizer extended with `<CODEC_*>` audio tokens |
| - Qwen-compatible model artifacts containing trained audio-token embeddings |
|
|
| The codec uses 20,480 audio tokens with the canonical token format: |
|
|
| ```text |
| <CODEC_0>, <CODEC_1>, ..., <CODEC_20479> |
| ``` |
|
|
| ## Training Data |
|
|
| The codec was trained on LibriSpeech `train-clean-100` with paired transcripts. |
| The validation split used during training is LibriSpeech `validation`. |
|
|
| Because training is speech-centric and transcript-supervised, performance may be |
| weaker on non-English speech, conversational speech, music, environmental audio, |
| or audio with strong noise and overlap. |
|
|
| ## Training Procedure |
|
|
| Base components: |
|
|
| - Base codec: AUV |
| - Frozen LLM backbone: Qwen3-4B-Instruct |
| - Token rate: 50 Hz |
| - Audio vocabulary size: 20,480 |
| - Segment length: 4 seconds |
|
|
| Losses: |
|
|
| - reconstruction mel loss |
| - multi-scale mel loss |
| - multi-resolution STFT loss |
| - complex STFT loss with phase term |
| - VQ commitment loss |
| - Gumbel bridge cross entropy |
| - Future Token Prediction loss |
| - Semantic Alignment cosine loss |
| - Semantic Alignment contrastive loss with memory bank |
| - MPD/MSD GAN and feature matching losses |
|
|
| ## Evaluation Results |
|
|
| ### Token Learnability |
|
|
| SALMon speech coherence accuracy after token-level LM training: |
|
|
| | Tokenizer | Overall accuracy | |
| | --- | ---: | |
| | WavTok-L | 48.3 | |
| | BigCodec | 49.4 | |
| | UniCodec | 50.1 | |
| | AUV | 49.4 | |
| | LLM-Codec | 61.6 | |
|
|
| Token-level perplexity on LibriSpeech after 3 epochs of LM training: |
|
|
| | Tokenizer | Eval loss | Perplexity | |
| | --- | ---: | ---: | |
| | WavTok-L | 11.91 | 148,122 | |
| | UniCodec | 11.92 | 150,197 | |
| | BigCodec | 11.96 | 156,448 | |
| | AUV | 11.98 | 159,768 | |
| | LLM-Codec | 8.44 | 4,617 | |
|
|
| ### Reconstruction Quality |
|
|
| Codec-SUPERB-tiny speech reconstruction: |
|
|
| | Model | Mel lower is better | STFT lower is better | PESQ higher is better | STOI higher is better | |
| | --- | ---: | ---: | ---: | ---: | |
| | AUV base | 0.762 | 1.648 | 2.094 | 0.850 | |
| | LLM-Codec | 0.724 | 1.599 | 2.102 | 0.859 | |
|
|
| ## Limitations |
|
|
| - The semantic alignment objective depends on paired speech and text. |
| - The model is primarily validated on read speech. |
| - Downstream generation quality depends on the separate speech language model. |
| - The model may preserve speaker identity information present in the input. |
| - The Hugging Face `transformers` artifacts are not a standalone text chatbot; |
| they accompany the codec/tokenizer workflow. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{chung2026llm, |
| title={LLM-Codec: Neural Audio Codec Meets Language Model Objectives}, |
| author={Chung, Ho-Lam and Chen, Yiming and Lee, Hung-yi}, |
| journal={arXiv preprint arXiv:2604.17852}, |
| note = {Model and code available at https://github.com/voidful/llm-codec}, |
| year={2026} |
| } |
| ``` |
|
|
| If you use the Codec-SUPERB interface or benchmark, please also cite |
| Codec-SUPERB: |
|
|
| ```bibtex |
| @inproceedings{wu-etal-2024-codec, |
| title = {Codec-SUPERB: An In-Depth Analysis of Sound Codec Models}, |
| author = {Wu, Haibin and Chung, Ho-Lam and Lin, Yi-Cheng and Wu, Yuan-Kuei and Chen, Xuanjun and Pai, Yu-Chi and Wang, Hsiu-Hsuan and Chang, Kai-Wei and Liu, Alexander and Lee, Hung-yi}, |
| booktitle = {Findings of the Association for Computational Linguistics: ACL 2024}, |
| year = {2024}, |
| url = {https://aclanthology.org/2024.findings-acl.616}, |
| doi = {10.18653/v1/2024.findings-acl.616}, |
| pages = {10330--10348} |
| } |
| ``` |
|
|