ALIA-40b-instruct-2512 (NVFP4)

This repository provides an NVFP4-quantized checkpoint of BSC-LT/ALIA-40b-instruct-2512, produced with NVIDIA TensorRT Model Optimizer (ModelOpt) and intended for high-throughput, memory-efficient inference using TensorRT-LLM on NVIDIA GPUs.

Base model: BSC-LT/ALIA-40b-instruct-2512
Quantization: NVFP4 (post-training quantization via NVIDIA Model Optimizer)
Primary runtime: TensorRT-LLM
License: Apache-2.0

⚠️ This is a quantized inference checkpoint. It is not intended for further fine-tuning.

Model Overview

Description

ALIA-40b-instruct-2512 is an instruction-tuned, multilingual large language model from the ALIA family, designed for high-quality instruction following and conversational use, with a focus on Iberian and English languages.

This repository contains an NVFP4-quantized variant of the model, optimized for deployment with TensorRT-LLM, significantly reducing memory footprint and improving inference throughput compared to BF16/FP16 weights.

Languages

Primary languages include:

Spanish
Catalan
Basque
Galician
English

Architecture (from the base model)

Parameters: ~40B
Layers: 48
Hidden size: 8192
Attention heads: 64
Context length: up to 163,840 tokens
Vocabulary size: 256,000
Base precision: bfloat16

What is NVFP4?

NVFP4 is a low-precision numerical format used by NVIDIA’s inference stack, where weights and activations of linear layers inside transformer blocks are quantized for efficient execution.

This model was produced using NVIDIA Model Optimizer (ModelOpt) and is intended to be consumed by TensorRT-LLM.

Practical implications:

Substantially reduced VRAM usage
Higher tokens/sec throughput
Slight quality degradation possible versus BF16
Best suited for production inference, not training

Intended Use

Direct Use

Multilingual chat assistants
Question answering
Summarization
Translation
Retrieval-augmented generation (RAG) pipelines

Out-of-Scope Use

Malicious or harmful applications
High-risk decision making without human oversight
Any use that violates applicable laws or ethical standards

How to Use

Recommended: TensorRT-LLM

This model is designed for use with TensorRT-LLM, either via the Python API or through a serving endpoint.

Example: TensorRT-LLM Python API

from tensorrt_llm import SamplingParams
from tensorrt_llm._torch import LLM

prompts = [
    "Escriu un petit conte en Catalá que parli de tres germanes bessones."
]

sampling_params = SamplingParams(
    max_tokens=256,
    temperature=0.1
)

llm = LLM(
    model="langtech-innovation/ALIA-40b-instruct-2512_nvfp4",
    tensor_parallel_size=1,   # adjust to number of GPUs
    backend="pytorch"
)

outputs = llm.generate(prompts, sampling_params)
print(outputs[0].outputs[0].text)

Serving via HTTP

You can also deploy this model using the TensorRT-LLM server inside an NVIDIA container (e.g. DGX Spark workflows), mounting the model directory and exposing an OpenAI-compatible or custom API.

Prompting / Chat Template

The base model uses a ChatML-style prompt format via:

tokenizer.apply_chat_template(...)

For best results, ensure your serving stack preserves the original chat template semantics from the base model.

Hardware & Compatibility

Designed for NVIDIA GPUs supported by TensorRT-LLM
Recommended for recent architectures (e.g. H100, B200, or DGX Spark environments)
Kernel availability and performance depend on your TensorRT-LLM and CUDA versions

Limitations and Safety

Quantization may slightly affect:
- Formatting fidelity
- Multilingual edge cases
- Numerical reasoning
The model may reflect biases present in its training data
Deployers should implement:
- Output filtering
- Prompt-injection defenses
- Monitoring and human review for sensitive use cases

Training and Quantization Details

Base training: See the BSC-LT/ALIA-40b-instruct-2512 model card for full training, alignment, and data details.
Quantization: Post-training NVFP4 quantization using NVIDIA Model Optimizer.
Calibration data: Not specified in this repository.

Evaluation

No additional evaluation is provided for the NVFP4 checkpoint.

Users are encouraged to benchmark:

Instruction following quality
Spanish/Catalan performance
Safety and refusal behavior
Latency, throughput, and VRAM usage

against the BF16 base model.

License

This model is released under the Apache License 2.0, consistent with the base model.

Acknowledgements

BSC Language Technologies Lab – upstream ALIA model
NVIDIA – TensorRT-LLM and Model Optimizer tooling

Downloads last month: 42

Safetensors

Model size

22B params

Tensor type

BF16

F8_E4M3