Update README.md

a0ef01c verified 3 days ago

7.98 kB

	---
	base_model:
	- Qwen/Qwen3-4B-Instruct-2507
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- audio
	- speech
	- audio-codec
	- neural-audio-codec
	- spoken-language-modeling
	- codec-superb
	- qwen3
	datasets:
	- librispeech_asr
	metrics:
	- perplexity
	- pesq
	- stoi
	---

	# LLM-Codec

	LLM-Codec is a neural audio codec checkpoint trained to produce discrete audio
	tokens that are both reconstructable and easier for autoregressive language
	models to predict.

	Model: https://huggingface.co/voidful/llm-codec

	Code: https://github.com/voidful/llm-codec

	Usage reference: https://github.com/voidful/Codec-SUPERB

	## Model Description

	Most neural audio codecs are trained for waveform reconstruction. Spoken
	language models, however, consume codec tokens with a next-token prediction
	objective. This mismatch can make acoustically valid variation appear as token
	uncertainty to the language model.

	LLM-Codec adapts a codec with language-model-facing objectives while keeping the
	deployed codec interface unchanged. The model is trained with:

	- Future Token Prediction (FTP): Medusa-style heads predict future audio tokens
	from frozen-LLM hidden states.
	- Semantic Alignment (SA): audio-induced hidden states are aligned with paired
	text hidden states inside a frozen LLM.
	- Differentiable Gumbel bridge: hard Gumbel-Softmax keeps discrete forward
	tokens while enabling gradients to flow to the codec encoder.
	- Reconstruction losses: mel, multi-scale mel, multi-resolution STFT, complex
	STFT, VQ, GAN, and feature matching losses.

	The deployed codec does not require the auxiliary FTP heads.

	## Intended Use

	This model is intended for research and development in:

	- audio tokenization for spoken language modeling
	- codec reconstruction experiments
	- token-level speech LM training
	- Codec-SUPERB style codec evaluation
	- speech token analysis and ablation studies

	It is not a full text-to-speech system by itself. For speech generation, use the
	codec as the tokenizer/decoder inside a separate speech language modeling
	pipeline.

	## Out-of-Scope Use

	Do not use this model for:

	- impersonation or unauthorized voice cloning
	- surveillance or speaker tracking without consent
	- high-stakes speaker, language, or identity decisions
	- generating deceptive audio content

	## Installation

	The easiest inference path is through the Codec-SUPERB `SoundCodec` interface.

	```bash
	git clone https://github.com/voidful/Codec-SUPERB.git
	cd Codec-SUPERB
	pip install -r requirements.txt
	export PYTHONPATH=$PWD:$PYTHONPATH
	```

	If your environment supports editable installs, this is also convenient:

	```bash
	pip install -e .
	```

	## Quick Start

	Load LLM-Codec through the Codec-SUPERB codec registry:

	```python
	from SoundCodec import codec

	print(codec.list_codec())
	model = codec.load_codec("llmcodec")
	```

	Encode and reconstruct one audio file:

	```python
	from SoundCodec import codec
	import torchaudio
	import soundfile as sf

	model = codec.load_codec("llmcodec")

	waveform, sample_rate = torchaudio.load("sample_audio.wav")
	data_item = {
	"audio": {
	"array": waveform.numpy()[0],
	"sampling_rate": sample_rate,
	}
	}

	units = model.extract_unit(data_item).unit
	print("Unit shape:", units.shape)

	result = model.synth(data_item, local_save=False)
	reconstructed = result["audio"]["array"]
	reconstructed_sr = result["audio"].get("sampling_rate", sample_rate)

	sf.write("reconstructed.wav", reconstructed, reconstructed_sr)
	```

	## Batch Usage

	Codec-SUPERB also provides batch APIs:

	```python
	from SoundCodec import codec
	import torchaudio

	model = codec.load_codec("llmcodec")

	audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
	data_list = []

	for path in audio_files:
	waveform, sample_rate = torchaudio.load(path)
	data_list.append({
	"id": path,
	"audio": {
	"array": waveform.numpy()[0],
	"sampling_rate": sample_rate,
	},
	})

	batch_units = model.batch_extract_unit(data_list)
	batch_audio = model.batch_decode_unit(batch_units)

	results = model.batch_synth(data_list, local_save=False)
	for item in results:
	print(item["unit"].shape, item["audio"]["array"].shape)
	```

	For better throughput, group audio samples with similar lengths before batching.

	## Codec-SUPERB Evaluation

	To evaluate LLM-Codec with Codec-SUPERB-tiny:

	```bash
	PYTHONPATH=. python3 scripts/dataset_creator.py \
	--dataset voidful/codec-superb-tiny

	PYTHONPATH=. python3 scripts/benchmarking.py \
	--dataset datasets/voidful/codec-superb-tiny_synth \
	--models llmcodec
	```

	## Model Files

	The model repository provides:

	- codec weights as `llm-codec.pt`
	- a tokenizer extended with `<CODEC_*>` audio tokens
	- Qwen-compatible model artifacts containing trained audio-token embeddings

	The codec uses 20,480 audio tokens with the canonical token format:

	```text
	<CODEC_0>, <CODEC_1>, ..., <CODEC_20479>
	```

	## Training Data

	The codec was trained on LibriSpeech `train-clean-100` with paired transcripts.
	The validation split used during training is LibriSpeech `validation`.

	Because training is speech-centric and transcript-supervised, performance may be
	weaker on non-English speech, conversational speech, music, environmental audio,
	or audio with strong noise and overlap.

	## Training Procedure

	Base components:

	- Base codec: AUV
	- Frozen LLM backbone: Qwen3-4B-Instruct
	- Token rate: 50 Hz
	- Audio vocabulary size: 20,480
	- Segment length: 4 seconds

	Losses:

	- reconstruction mel loss
	- multi-scale mel loss
	- multi-resolution STFT loss
	- complex STFT loss with phase term
	- VQ commitment loss
	- Gumbel bridge cross entropy
	- Future Token Prediction loss
	- Semantic Alignment cosine loss
	- Semantic Alignment contrastive loss with memory bank
	- MPD/MSD GAN and feature matching losses

	## Evaluation Results

	### Token Learnability

	SALMon speech coherence accuracy after token-level LM training:

	\| Tokenizer \| Overall accuracy \|
	\| --- \| ---: \|
	\| WavTok-L \| 48.3 \|
	\| BigCodec \| 49.4 \|
	\| UniCodec \| 50.1 \|
	\| AUV \| 49.4 \|
	\| LLM-Codec \| 61.6 \|

	Token-level perplexity on LibriSpeech after 3 epochs of LM training:

	\| Tokenizer \| Eval loss \| Perplexity \|
	\| --- \| ---: \| ---: \|
	\| WavTok-L \| 11.91 \| 148,122 \|
	\| UniCodec \| 11.92 \| 150,197 \|
	\| BigCodec \| 11.96 \| 156,448 \|
	\| AUV \| 11.98 \| 159,768 \|
	\| LLM-Codec \| 8.44 \| 4,617 \|

	### Reconstruction Quality

	Codec-SUPERB-tiny speech reconstruction:

	\| Model \| Mel lower is better \| STFT lower is better \| PESQ higher is better \| STOI higher is better \|
	\| --- \| ---: \| ---: \| ---: \| ---: \|
	\| AUV base \| 0.762 \| 1.648 \| 2.094 \| 0.850 \|
	\| LLM-Codec \| 0.724 \| 1.599 \| 2.102 \| 0.859 \|

	## Limitations

	- The semantic alignment objective depends on paired speech and text.
	- The model is primarily validated on read speech.
	- Downstream generation quality depends on the separate speech language model.
	- The model may preserve speaker identity information present in the input.
	- The Hugging Face `transformers` artifacts are not a standalone text chatbot;
	they accompany the codec/tokenizer workflow.

	## Citation

	```bibtex
	@article{chung2026llm,
	title={LLM-Codec: Neural Audio Codec Meets Language Model Objectives},
	author={Chung, Ho-Lam and Chen, Yiming and Lee, Hung-yi},
	journal={arXiv preprint arXiv:2604.17852},
	note = {Model and code available at https://github.com/voidful/llm-codec},
	year={2026}
	}
	```

	If you use the Codec-SUPERB interface or benchmark, please also cite
	Codec-SUPERB:

	```bibtex
	@inproceedings{wu-etal-2024-codec,
	title = {Codec-SUPERB: An In-Depth Analysis of Sound Codec Models},
	author = {Wu, Haibin and Chung, Ho-Lam and Lin, Yi-Cheng and Wu, Yuan-Kuei and Chen, Xuanjun and Pai, Yu-Chi and Wang, Hsiu-Hsuan and Chang, Kai-Wei and Liu, Alexander and Lee, Hung-yi},
	booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
	year = {2024},
	url = {https://aclanthology.org/2024.findings-acl.616},
	doi = {10.18653/v1/2024.findings-acl.616},
	pages = {10330--10348}
	}
	```