Transformers documentation
BERTology
Get started
Tutorials
Pipelines for inferenceLoad pretrained instances with an AutoClassPreprocessFine-tune a pretrained modelDistributed training with 🤗 AccelerateShare a model
How-to guides
Use tokenizers from 🤗 TokenizersCreate a custom architectureSharing custom models
Fine-tune for downstream tasks
Text classificationToken classificationQuestion answeringLanguage modelingTranslationSummarizationMultiple choiceAudio classificationAutomatic speech recognitionImage classification
Train with a scriptRun training on Amazon SageMakerInference for multilingual modelsConverting TensorFlow CheckpointsExport 🤗 Transformers modelsPerformance and scalability
OverviewTraining on one GPUTraining on many GPUsTraining on CPUTraining on TPUsTraining on Specialized HardwareInference on CPUInference on one GPUInference on many GPUsInference on Specialized HardwareCustom hardware for training
Instantiating a big modelBenchmarksMigrating from previous packagesTroubleshootDebugging🤗 Transformers NotebooksCommunityHow to contribute to transformers?How to add a model to 🤗 Transformers?How to create a custom pipeline?TestingChecks on a Pull RequestConceptual guides
PhilosophyGlossarySummary of the tasksSummary of the modelsSummary of the tokenizersPadding and truncationBERTologyPerplexity of fixed-length models
API
Main Classes
CallbacksConfigurationData CollatorKeras callbacksLoggingModelsText GenerationONNXOptimizationModel outputsPipelinesProcessorsTokenizerTrainerDeepSpeed IntegrationFeature Extractor
Models
ALBERTAuto ClassesBARTBARThezBARTphoBEiTBERTBertGenerationBertJapaneseBertweetBigBirdBigBirdPegasusBlenderbotBlenderbot SmallBLOOMBORTByT5CamemBERTCANINECLIPCodeGenConvBERTConvNeXTCPMCTRLCvTData2VecDeBERTaDeBERTa-v2Decision TransformerDeiTDETRDialoGPTDistilBERTDiTDPRDPTELECTRAEncoder Decoder ModelsFlauBERTFLAVAFNetFSMTFunnel TransformerGLPNGPTGPT NeoGPT NeoXGPT-JGPT2GroupViTHerBERTHubertI-BERTImageGPTLayoutLMLayoutLMV2LayoutLMV3LayoutXLMLEDLeViTLongformerLongT5LUKELXMERTM2M100MarianMTMaskFormerMBart and MBart-50MCTCTMegatronBERTMegatronGPT2mLUKEMobileBERTMobileViTMPNetMT5MVPNEZHANLLBNyströmformerOPTOWL-ViTPegasusPerceiverPhoBERTPLBartPoolFormerProphetNetQDQBertRAGREALMReformerRegNetRemBERTResNetRetriBERTRoBERTaRoFormerSegFormerSEWSEW-DSpeech Encoder Decoder ModelsSpeech2TextSpeech2Text2SplinterSqueezeBERTSwin TransformerT5T5v1.1TAPASTAPEXTrajectory TransformerTransformer XLTrOCRUL2UniSpeechUniSpeech-SATVANViLTVision Encoder Decoder ModelsVision Text Dual EncoderVision Transformer (ViT)VisualBERTViTMAEWav2Vec2Wav2Vec2-ConformerWav2Vec2PhonemeWavLMXGLMXLMXLM-ProphetNetXLM-RoBERTaXLM-RoBERTa-XLXLNetXLS-RXLSR-Wav2Vec2YOLOSYOSO
Internal Helpers
You are viewing v4.21.0 version. A newer version v5.8.1 is available.
BERTology
There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call “BERTology”). Some good examples of this field are:
- BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950
- Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
- What Does BERT Look At? An Analysis of BERT’s Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341
In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
- accessing all the hidden-states of BERT/GPT/GPT-2,
- accessing all the attention weights for each head of BERT/GPT/GPT-2,
- retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.
To help you understand and use these features, we have added a specific example script: bertology.py while extract information and prune a model pre-trained on GLUE.