ALIA-40b-instruct-2512 (NVFP4)

This repository provides an NVFP4-quantized checkpoint of BSC-LT/ALIA-40b-instruct-2512, produced with NVIDIA TensorRT Model Optimizer (ModelOpt) and intended for high-throughput, memory-efficient inference using TensorRT-LLM on NVIDIA GPUs.

  • Base model: BSC-LT/ALIA-40b-instruct-2512
  • Quantization: NVFP4 (post-training quantization via NVIDIA Model Optimizer)
  • Primary runtime: TensorRT-LLM
  • License: Apache-2.0

⚠️ This is a quantized inference checkpoint. It is not intended for further fine-tuning.


Model Overview

Description

ALIA-40b-instruct-2512 is an instruction-tuned, multilingual large language model from the ALIA family, designed for high-quality instruction following and conversational use, with a focus on Iberian and English languages.

This repository contains an NVFP4-quantized variant of the model, optimized for deployment with TensorRT-LLM, significantly reducing memory footprint and improving inference throughput compared to BF16/FP16 weights.

Languages

Primary languages include:

  • Spanish
  • Catalan
  • Basque
  • Galician
  • English

Architecture (from the base model)

  • Parameters: ~40B
  • Layers: 48
  • Hidden size: 8192
  • Attention heads: 64
  • Context length: up to 163,840 tokens
  • Vocabulary size: 256,000
  • Base precision: bfloat16

What is NVFP4?

NVFP4 is a low-precision numerical format used by NVIDIA’s inference stack, where weights and activations of linear layers inside transformer blocks are quantized for efficient execution.

This model was produced using NVIDIA Model Optimizer (ModelOpt) and is intended to be consumed by TensorRT-LLM.

Practical implications:

  • Substantially reduced VRAM usage
  • Higher tokens/sec throughput
  • Slight quality degradation possible versus BF16
  • Best suited for production inference, not training

Intended Use

Direct Use

  • Multilingual chat assistants
  • Question answering
  • Summarization
  • Translation
  • Retrieval-augmented generation (RAG) pipelines

Out-of-Scope Use

  • Malicious or harmful applications
  • High-risk decision making without human oversight
  • Any use that violates applicable laws or ethical standards

How to Use

Recommended: TensorRT-LLM

This model is designed for use with TensorRT-LLM, either via the Python API or through a serving endpoint.

Example: TensorRT-LLM Python API

from tensorrt_llm import SamplingParams
from tensorrt_llm._torch import LLM

prompts = [
    "Escriu un petit conte en Catalá que parli de tres germanes bessones."
]

sampling_params = SamplingParams(
    max_tokens=256,
    temperature=0.1
)

llm = LLM(
    model="langtech-innovation/ALIA-40b-instruct-2512_nvfp4",
    tensor_parallel_size=1,   # adjust to number of GPUs
    backend="pytorch"
)

outputs = llm.generate(prompts, sampling_params)
print(outputs[0].outputs[0].text)

Serving via HTTP

You can also deploy this model using the TensorRT-LLM server inside an NVIDIA container (e.g. DGX Spark workflows), mounting the model directory and exposing an OpenAI-compatible or custom API.


Prompting / Chat Template

The base model uses a ChatML-style prompt format via:

tokenizer.apply_chat_template(...)

For best results, ensure your serving stack preserves the original chat template semantics from the base model.


Hardware & Compatibility

  • Designed for NVIDIA GPUs supported by TensorRT-LLM
  • Recommended for recent architectures (e.g. H100, B200, or DGX Spark environments)
  • Kernel availability and performance depend on your TensorRT-LLM and CUDA versions

Limitations and Safety

  • Quantization may slightly affect:

    • Formatting fidelity
    • Multilingual edge cases
    • Numerical reasoning
  • The model may reflect biases present in its training data

  • Deployers should implement:

    • Output filtering
    • Prompt-injection defenses
    • Monitoring and human review for sensitive use cases

Training and Quantization Details

  • Base training: See the BSC-LT/ALIA-40b-instruct-2512 model card for full training, alignment, and data details.
  • Quantization: Post-training NVFP4 quantization using NVIDIA Model Optimizer.
  • Calibration data: Not specified in this repository.

Evaluation

No additional evaluation is provided for the NVFP4 checkpoint.

Users are encouraged to benchmark:

  • Instruction following quality
  • Spanish/Catalan performance
  • Safety and refusal behavior
  • Latency, throughput, and VRAM usage

against the BF16 base model.


License

This model is released under the Apache License 2.0, consistent with the base model.


Acknowledgements

  • BSC Language Technologies Lab – upstream ALIA model
  • NVIDIA – TensorRT-LLM and Model Optimizer tooling

Downloads last month
42
Safetensors
Model size
22B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support