Transformers documentation
FBGEMM FP8
Get started
Tutorials
Run inference with pipelinesWrite portable code with AutoClassPreprocess dataFine-tune a pretrained modelTrain with a scriptSet up distributed training with 🤗 AccelerateLoad and train adapters with 🤗 PEFTShare your modelAgents 101Agents, supercharged - Multi-agents, External tools, and moreGeneration with LLMsChatting with Transformers
Task Guides
Natural Language Processing
Audio
Computer Vision
Multimodal
Generation
Prompting
Developer guides
Use fast tokenizers from 🤗 TokenizersRun inference with multilingual modelsUse model-specific APIsShare a custom modelChat templatesTrainerRun training on Amazon SageMakerExport to ONNXExport to TFLiteExport to TorchScriptBenchmarksNotebooks with examplesCommunity resourcesTroubleshootInteroperability with GGUF filesInteroperability with TikToken filesModularity in `transformers`Model Hacking (overwriting a class to your usage)
Quantization Methods
Getting startedbitsandbytesGPTQAWQAQLMQuantoEETQHQQFBGEMM_FP8OptimumTorchAOBitNetcompressed-tensorsContribute new quantization method
Performance and scalability
OverviewLLM inference optimization Instantiate a big modelDebuggingXLA Integration for TensorFlow ModelsOptimize inference using `torch.compile()`
Efficient training techniques
Methods and tools for efficient training on a single GPUMultiple GPUs and parallelismFully Sharded Data ParallelDeepSpeedEfficient training on CPUDistributed CPU trainingTraining on TPU with TensorFlowPyTorch training on Apple siliconCustom hardware for trainingHyperparameter Search using Trainer API
Optimizing inference
Contribute
How to contribute to 🤗 Transformers?How to add a model to 🤗 Transformers?How to add a pipeline to 🤗 Transformers?TestingChecks on a Pull Request
Conceptual guides
PhilosophyGlossaryWhat 🤗 Transformers can doHow 🤗 Transformers solve tasksThe Transformer model familySummary of the tokenizersAttention mechanismsPadding and truncationBERTologyPerplexity of fixed-length modelsPipelines for webserver inferenceModel training anatomyGetting the most out of LLMs
API
Main Classes
Agents and ToolsAuto ClassesBackbonesCallbacksConfigurationData CollatorKeras callbacksLoggingModelsText GenerationONNXOptimizationModel outputsPipelinesProcessorsQuantizationTokenizerTrainerDeepSpeedExecuTorchFeature ExtractorImage Processor
Models
Text models
Vision models
Audio models
Video models
Multimodal models
Reinforcement learning models
Time series models
Graph models
Internal Helpers
You are viewing v4.46.0 version. A newer version v5.8.1 is available.
FBGEMM FP8
With FBGEMM FP8 quantization method, you can quantize your model in FP8 (W8A8):
- the weights will be quantized in 8bit (FP8) per channel
- the activation will be quantized in 8bit (FP8) per token
It relies on the FBGEMM library which provides efficient low-precision general matrix multiplication for small batch sizes and support for accuracy-loss minimizing techniques such as row-wise quantization and outlier-aware quantization.
You need a GPU with compute capability>=9 (e.g. H100)
Before you begin, make sure the following libraries are installed with their latest version:
pip install --upgrade accelerate fbgemm-gpu torch
If you are having issues with fbgemm-gpu and torch library, you might need to install the nightly release. You can follow the instruction here
from transformers import FbgemmFp8Config, AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Meta-Llama-3-8B"
quantization_config = FbgemmFp8Config()
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
output = quantized_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))A quantized model can be saved via “saved_pretrained” and be reused again via the “from_pretrained”.
quant_path = "/path/to/save/quantized/model"
model.save_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")