🎻Fine-tune Transformers in PyTorch using πŸ€— Transformers

Complete tutorial on how to fine-tune 73 transformer models for text classification β€” no code changes necessary!

George Mihaila
9 min readOct 9, 2020
Me trying to look like I’m β€˜fine-tuning’ a multimeter.


This notebook is designed to use a pretrained transformers model and fine-tune it on a classification task. The focus of this tutorial will be on the code itself and how to adjust it to your needs.

This notebook is using the AutoClasses from transformer by Hugging Face functionality. This functionality can guess a model’s configuration, tokenizer and architecture just by passing in the model’s name. This allows for code reusability on a large number of transformers models!

What should I know for this notebook?

I provided enough instructions and comments to be able to follow along with minimum Python coding knowledge.

Since I am using PyTorch to fine-tune our transformers models any knowledge on PyTorch is very useful. Knowing a little bit about the transformers library helps too.

How to use this notebook?

I built this notebook with reusability in mind. The way I load the dataset into the PyTorch Dataset class is pretty standard and can be easily reused for any other dataset.

The only modifications needed to use your own dataset will be in reading in the dataset inside the MovieReviewsDataset class which uses PyTorch Dataset. The DataLoader will return a dictionary of batch inputs format so that it can be fed straight to the model using the statement: outputs = model(**batch). As long as this statement holds, the rest of the code will work!

What transformers models work with this notebook?

There are rare cases where I use a different model than Bert when dealing with classification from text data. When there is a need to run a different transformer model architecture, which one would work with this code?

Since the name of the notebooks is finetune_transformers it should work with more than one type of transformers.

I ran this notebook across all the pretrained models found on Hugging Face Transformer. This way you know ahead of time if the model you plan to use works with this code without any modifications.

The list of pretrained transformers models that work with this notebook can be found here. There are 73 models that worked πŸ˜„ and 33 models that failed to work 😒 with this notebook.


This notebook will cover fine-tune transformers for binary classification task. I will use the well known movies reviews positive β€” negative labeled Large Movie Review Dataset.

The description provided on the Stanford website:

This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.

Why this dataset? I believe is an easy to understand and use dataset for classification. I think sentiment data is always fun to work with.


Now let’s do some coding! We will go through each coding cell in the notebook and describe what it does, what’s the code, and when is relevant β€” show the output.

I made this format to be easy to follow if you decide to run each code cell in your own python notebook.

When I learn from a tutorial I always try to replicate the results. I believe it’s easy to follow along if you have the code next to the explanations.


Download the Large Movie Review Dataset and unzip it locally.


  • transformers library needs to be installed to use all the awesome code from Hugging Face. To get the latest version I will install it straight from GitHub.
  • ml_things library used for various machine learning related tasks. I created this library to reduce the amount of code I need to write for each machine learning project. Give it a try!


Import all needed libraries for this notebook.

Declare parameters used for this notebook:

  • set_seed(123) - Always good to set a fixed seed for reproducibility.
  • epochs - Number of training epochs (authors recommend between 2 and 4).
  • batch_size - Number of batches - depending on the max sequence length and GPU memory. For 512 sequence length a batch of 10 USUALY works without cuda memory issues. For small sequence length can try batch of 32 or higher.
  • max_length - Pad or truncate text sequences to a specific length. I will set it to 60 tokens to speed up training.
  • device - Look for gpu to use. I will use cpu by default if no gpu found.
  • model_name_or_path - Name of transformers model - will use already pretrained model. Path of transformer model - will load your own model from local disk. I always like to start off with bert-base-cased: 12-layer, 768-hidden, 12-heads, 109M parameters. Trained on cased English text.
  • labels_ids - Dictionary of labels and their id - this will be used to convert string labels to numbers.
  • n_labels - How many labels are we using in this dataset. This is used to decide size of classification head.

Helper Functions

I like to keep all Classes and functions that I will use in this notebook under this section to help maintain a clean look of the notebook:


If you worked with PyTorch before this is pretty standard. We need this class to read in our dataset, parse it, use tokenizer that transforms text into numbers and get it into a nice format to be fed to the model.

Lucky for use, Hugging Face thought of everything and made the tokenizer do all the heavy lifting (split text into tokens, padding, truncating, encode text into numbers) and is very easy to use!

In this class I only need to read in the content of each file, use fix_text to fix any Unicode problems and keep track of positive and negative sentiments.

I will append all texts and labels in lists that later I will feed to the tokenizer and to the label ids to transform everything into numbers.

There are three main parts of this PyTorch Dataset class:

  • init() where we read in the dataset and transform text and labels into numbers.
  • __len__() where we need to return the number of examples we read in. This is used when calling len(MovieReviewsDataset()) .
  • __getitem__() always takes as an input an int valuet hat represents which example from our examples to return from our dataset. If a value of 3 is passed, we will return the example form our dataset at position 3. It needs to return an object with the format that can be fed to our model. Luckily our tokenizer does that for us and returns a dictionary of variables ready to be fed to the model in this way:model(**inputs).

train(dataloader, optimizer_, scheduler_, device_)

I created this function to perform a full pass through the DataLoader object (the DataLoader object is created from our Dataset type object using the MovieReviewsDataset class). This is basically one epoch train through the entire dataset.

The dataloader is created from PyTorch DataLoader which takes the object created from MovieReviewsDataset class and puts each example in batches. This way we can feed our model batches of data!

The optimizer_ and scheduler_ are very common in PyTorch. They are required to update the parameters of our model and update our learning rate during training. There is a lot more than that but I won’t go into details. This can actually be a huge rabbit hole since A LOT happens behind these functions that we don’t need to worry. Thank you PyTorch!

In the process we keep track of the actual labels and the predicted labels along with the loss.

validation(dataloader, device_)

I implemented this function in a very similar way as train but without the parameters update, backward pass and gradient decent part. We don’t need to do all of those VERY computationally intensive tasks because we only care about our model’s predictions.

I use the DataLoader in a similar way as in train to get out batches to feed to our model.

In the process I keep track of the actual labels and the predicted labels along with the loss.

Load Model and Tokenizer

Loading the three essential parts of the pretrained transformers: configuration, tokenizer and model. I also need to load the model on the device I’m planning to use (GPU / CPU).

Since I use the AutoClass functionality from Hugging Face I only need to worry about the model’s name as input and the rest is handled by the transformers library.

Dataset and DataLoader

This is where I create the PyTorch Dataset and DataLoader objects that will be used to feed data into our model.

This is where I use the MovieReviewsDataset class and create the dataset variables. Since data is partitioned for both train and test I will create a PyTorch Dataset and PyTorch DataLoader object for train and test. ONLY for simplicity I will use the test as validation. In practice NEVER USE THE TEST DATA FOR VALIDATION!


I create an optimizer and scheduler that will be used by PyTorch in training.

I loop through the number of defined epochs and call the train and validation functions.

I will output similar info after each epoch as in Keras: train_loss: β€” val_loss: β€” train_acc: β€” valid_acc.

After training, I plot the train and validation loss and accuracy curves to check how the training went.

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ|4/4[13:49<00:00, 207.37s/it]
Training on batches...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ|782/782[02:40<00:00,4.86it/s] Validation on batches...
train_loss: 0.44816 - val_loss: 0.38655 - train_acc: 0.78372 - valid_acc: 0.81892
Training on batches...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ|782/782[02:40<00:00,4.86it/s] Validation on batches...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ|782/782 [02:13<00:00,5.88it/s] train_loss: 0.29504 - val_loss: 0.43493 - train_acc: 0.87352 -valid_acc: 0.82360
Training on batches...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ|782/782[02:40<00:00, 4.87it/s]
Validation on batches...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ|782/782[01:43<00:00,7.58it/s] train_loss: 0.16901 - val_loss: 0.48433 - train_acc: 0.93544 -valid_acc: 0.82624

Training on batches...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ|782/782[02:40<00:00, 4.87it/s] Validation on batches...
train_loss: 0.09816 - val_loss: 0.73001 - train_acc: 0.96936 - valid_acc: 0.82144
train validation loss
validation accuracy

It looks like a little over one epoch is enough training for this model and dataset.


When dealing with classification it’s useful to look at precision, recall and f1 score. Another good thing to look at when evaluating the model is the confusion matrix.

Outputs:100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ|782/782[00:46<00:00,16.77it/s]precision    recall  f1-score   support           
neg 0.83 0.81 0.82 12500
pos 0.81 0.83 0.82 12500

accuracy 0.82 25000
macro avg 0.82 0.82 0.82 25000
weighted avg 0.82 0.82 0.82 25000
confusion matrix

Results are not great, but for this tutorial we are not interested in performance.

Final Note

If you made it this far Congrats! 🎊 and Thank you! πŸ™ for your interest in my tutorial!

I’ve been using this code for a while now and I feel it got to a point where is nicely documented and easy to follow.

Of course is easy for me to follow because I built it. That is why any feedback is welcome and it helps me improve my future tutorials!

If you see something wrong please let me know by opening an issue on my ml_things GitHub repository!

A lot of tutorials out there are mostly a one-time thing and are not being maintained. I plan on keeping my tutorials up to date as much as I can.

Contact 🎣

🦊 GitHub: gmihaila

🌐 Website: gmihaila.github.io

πŸ‘” LinkedIn: mihailageorge

πŸ“¬ Email: georgemihaila@my.unt.edu.com

Originally published on my GitHub website at https://gmihaila.github.io.



George Mihaila

PhD Computer Science πŸ‘¨β€πŸ’» | Working πŸ‹οΈ with love ❀️ on Deep Learning πŸ€– & Natural Language Processing πŸ—£οΈ.