ALIA-40b-instruct-2512 (NVFP4)
This repository provides an NVFP4-quantized checkpoint of BSC-LT/ALIA-40b-instruct-2512, produced with NVIDIA TensorRT Model Optimizer (ModelOpt) and intended for high-throughput, memory-efficient inference using TensorRT-LLM on NVIDIA GPUs.
- Base model:
BSC-LT/ALIA-40b-instruct-2512 - Quantization: NVFP4 (post-training quantization via NVIDIA Model Optimizer)
- Primary runtime: TensorRT-LLM
- License: Apache-2.0
⚠️ This is a quantized inference checkpoint. It is not intended for further fine-tuning.
Model Overview
Description
ALIA-40b-instruct-2512 is an instruction-tuned, multilingual large language model from the ALIA family, designed for high-quality instruction following and conversational use, with a focus on Iberian and English languages.
This repository contains an NVFP4-quantized variant of the model, optimized for deployment with TensorRT-LLM, significantly reducing memory footprint and improving inference throughput compared to BF16/FP16 weights.
Languages
Primary languages include:
- Spanish
- Catalan
- Basque
- Galician
- English
Architecture (from the base model)
- Parameters: ~40B
- Layers: 48
- Hidden size: 8192
- Attention heads: 64
- Context length: up to 163,840 tokens
- Vocabulary size: 256,000
- Base precision: bfloat16
What is NVFP4?
NVFP4 is a low-precision numerical format used by NVIDIA’s inference stack, where weights and activations of linear layers inside transformer blocks are quantized for efficient execution.
This model was produced using NVIDIA Model Optimizer (ModelOpt) and is intended to be consumed by TensorRT-LLM.
Practical implications:
- Substantially reduced VRAM usage
- Higher tokens/sec throughput
- Slight quality degradation possible versus BF16
- Best suited for production inference, not training
Intended Use
Direct Use
- Multilingual chat assistants
- Question answering
- Summarization
- Translation
- Retrieval-augmented generation (RAG) pipelines
Out-of-Scope Use
- Malicious or harmful applications
- High-risk decision making without human oversight
- Any use that violates applicable laws or ethical standards
How to Use
Recommended: TensorRT-LLM
This model is designed for use with TensorRT-LLM, either via the Python API or through a serving endpoint.
Example: TensorRT-LLM Python API
from tensorrt_llm import SamplingParams
from tensorrt_llm._torch import LLM
prompts = [
"Escriu un petit conte en Catalá que parli de tres germanes bessones."
]
sampling_params = SamplingParams(
max_tokens=256,
temperature=0.1
)
llm = LLM(
model="langtech-innovation/ALIA-40b-instruct-2512_nvfp4",
tensor_parallel_size=1, # adjust to number of GPUs
backend="pytorch"
)
outputs = llm.generate(prompts, sampling_params)
print(outputs[0].outputs[0].text)
Serving via HTTP
You can also deploy this model using the TensorRT-LLM server inside an NVIDIA container (e.g. DGX Spark workflows), mounting the model directory and exposing an OpenAI-compatible or custom API.
Prompting / Chat Template
The base model uses a ChatML-style prompt format via:
tokenizer.apply_chat_template(...)
For best results, ensure your serving stack preserves the original chat template semantics from the base model.
Hardware & Compatibility
- Designed for NVIDIA GPUs supported by TensorRT-LLM
- Recommended for recent architectures (e.g. H100, B200, or DGX Spark environments)
- Kernel availability and performance depend on your TensorRT-LLM and CUDA versions
Limitations and Safety
Quantization may slightly affect:
- Formatting fidelity
- Multilingual edge cases
- Numerical reasoning
The model may reflect biases present in its training data
Deployers should implement:
- Output filtering
- Prompt-injection defenses
- Monitoring and human review for sensitive use cases
Training and Quantization Details
- Base training: See the
BSC-LT/ALIA-40b-instruct-2512model card for full training, alignment, and data details. - Quantization: Post-training NVFP4 quantization using NVIDIA Model Optimizer.
- Calibration data: Not specified in this repository.
Evaluation
No additional evaluation is provided for the NVFP4 checkpoint.
Users are encouraged to benchmark:
- Instruction following quality
- Spanish/Catalan performance
- Safety and refusal behavior
- Latency, throughput, and VRAM usage
against the BF16 base model.
License
This model is released under the Apache License 2.0, consistent with the base model.
Acknowledgements
- BSC Language Technologies Lab – upstream ALIA model
- NVIDIA – TensorRT-LLM and Model Optimizer tooling
- Downloads last month
- 42