π Better Batches with PyTorchText BucketIterator
How to use PyTorchText BucketIterator to sort text data for better batching.
This notebook is a simple tutorial on how to use the powerful PytorchText BucketIterator functionality to group examples ( I use examples and sequences interchangeably) of similar lengths into batches. This allows us to provide the most optimal batches when training models with text data.
Having batches with similar length examples provides a lot of gain for recurrent models (RNN, GRU, LSTM) and transformers models (bert, roBerta, gpt2, xlnet, etc.) where padding will be minimal.
Basically any model that takes as input variable text data sequences will benefit from this tutorial.
I will not train any models in this notebook! I will release a tutorial where I use this implementation to train a transformer model.
The purpose is to use an example text datasets and batch it using PyTorchText with BucketIterator and show how it groups text sequences of similar length in batches.
This tutorial has two main parts:
- Using PyTorch Dataset with PyTorchText Bucket Iterator: Here I implemented a standard PyTorch Dataset class that reads in the example text datasets and use PyTorch Bucket Iterator to group similar length examples in same batches. I want to show how easy it is to use this powerful functionality form PyTorchText on a regular PyTorch Dataset workflow which you already have setup.
- Using PyTorch Text TabularDataset with PyTorchText Bucket Iterator: Here I use the built-in PyTorchText TabularDataset that reads data straight from local files without the need to create a PyTorch Dataset class. Then I follow same steps as in the previous part to show how nicely text examples are grouped together.
This notebooks is a code adaptation and implementation inspired from a few sources: torchtext_translation_tutorial, pytorch/text β GitHub, torchtext documentation and A Comprehensive Introduction to Torchtext.
What should I know for this notebook?
Some basic PyTorch regarding Dataset class and using DataLoaders. Some knowledge of PyTorchText is helpful but not critical in understanding this tutorial. The BucketIterator is similar in applying Dataloader to a PyTorch Dataset.
How to use this notebook?
The code is made with reusability in mind. It can be easily adapted for other text datasets and other NLP tasks in order to achieve optimal batching.
Comments should provide enough guidance to easily adapt this notebook to your needs.
This code is designed mostly for classification tasks in mind, but it can be adapted for any other Natural Language Processing tasks where batching text data is needed.
Dataset
I will use the well known movies reviews positive β negative labeled Large Movie Review Dataset.
The description provided on the Stanford website:
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.
Why this dataset? I believe is an easy to understand and use dataset for classification. I think sentiment data is always fun to work with.
Coding
Now letβs do some coding! We will go through each coding cell in the notebook and describe what it does, whatβs the code, and when is relevant β show the output.
I made this format to be easy to follow if you decide to run each code cell in your own python notebook.
When I learn from a tutorial I always try to replicate the results. I believe itβs easy to follow along if you have the code next to the explanations.
Downloads
Download the IMDB Movie Reviews sentiment dataset and unzip it locally.
Installs
Imports
Import all needed libraries for this notebook.
Declare basic parameters used for this notebook:
device
- Device to use by torch: GPU/CPU. I use CPU as default since I will not perform any costly operations.train_batch_size
- Batch size used on train data.valid_batch_size
- Batch size used for validation data. It usually is greater thantrain_batch_size
since the model would only need to make prediction and no gradient calculations is needed.
Using PyTorch Dataset
This is where I create the PyTorch Dataset objects for training and validation that can be used to feed data into a model. This is standard procedure when using PyTorch.
Dataset Class
Implementation of the PyTorch Dataset class.
Most important components in a PyTorch Dataset class are:
__len__(self, )
where it returns the number of examples in our dataset that we read in__init__(self, )
. This will ensure thatlen()
will return the number of examples.__getitem__(self, item)
where given an indexitem
will return the example corresponding to theitem
position.
Train β Validation Datasets
Create PyTorch Dataset for train and validation partitions.
PyTorch DataLoader
In order to group examples from the PyTorch Dataset into batches we use PyTorch DataLoader. This is standard when using PyTorch.
PyTorchText Bucket Iterator Dataloader
Here is where the magic happens! We pass in the train_dataset and valid_dataset PyTorch Dataset splits into BucketIterator to create the actual batches.
Itβs very nice that PyTorchText can handle splits! No need to write same line of code again for train and validation split.
The sort_key
parameter is very important! It is used to order text sequences in batches. Since we want to batch sequences of text with similar length, we will use a simple function that returns the length of an data example ( len(x['text')
). This function needs to follow the format of the PyTorch Dataset we created in order to return the length of an example, in my case I return a dictionary with text
key for an example.
It is important to keep sort=False
and sort_with_batch=True
to only sort the examples in each batch and not the examples in the whole dataset!
Find more details in the PyTorchText BucketIterator documentation here β look at the BPTTIterator because it has same parameters except the bptt_len argument.
Note:If you want just a single DataLoader use torchtext.data.BucketIterator
instead of torchtext.data.BucketIterator.splits
and make sure to provide just one PyTorch Dataset instead of tuple of PyTorch Datasets and change the parameter batch_sizes
and its tuple values to batch_size
with single value: dataloader = torchtext.data.BucketIterator(dataset, batch_size=batch_size, ).
Compare DataLoaders
Letβs compare the PyTorch DataLoader batches with the PyTorchText BucketIterator batches. We can see how nicely examples of similar length are grouped in same batch with PyTorchText.
Note: When using the PyTorchText BucketIterator, make sure to call create_batches()
before looping through each batch! Else you won't get any output form the iterator.
PyTorch DataLoaderBatch size: 10LABEL LENGTH TEXT
pos 1037 Fascinating movie, based on a true story, about an...
neg 1406 Or maybe that's what it feels like. Anyway, "The B...
pos 679 Far by my most second favourite cartoon Spielberg ...
neg 922 This movie reminds me of "IrrΓ©versible (2002)", an...
pos 214 There's never a dull moment in this movie. Wonderf...
neg 1288 I don't think any player in Hollywood history last...
pos 605 The thing I remember most about this film is that ...
pos 1411 Fabulous, fantastic, probably Disney's best musica...
neg 604 Just another film that exploits gratuitous frontal...
pos 368 What can i say about the first film ever?<br /><br...
PyTorchText BuketIteratorBatch size: 10LABEL LENGTH TEXT
pos 609 That's My Bush is a live action project made by So...
neg 610 Terminus Paradis was exceptional, but "Niki ardele...
neg 612 Awesomely improbable and foolish potboiler that at...
pos 613 The events of September 11 2001 do not need extra ...
pos 613 Okay, first of all I got this movie as a Christmas...
neg 617 I have been known to fall asleep during films, but...
pos 625 Fragglerock is excellent in the way that Schindler...
neg 625 Sure I've seen bad movies in my life, but this one...
neg 626 Even 20+ years later, Ninja Mission stands out as ...
pos 626 This film is excellently paced, you never have to ...
Train Loop Examples
Now letβs look at a model training loop would look like. I printed the first 10 batches list of examples lengths to show how nicely they are grouped throughout the dataset!
Batch examples lengths: [848, 848, 849, 849, 850, 852, 853, 854, 856, 857]
Batch examples lengths: [779, 780, 780, 781, 781, 782, 782, 782, 783, 784]
Batch examples lengths: [2100, 2103, 2104, 2109, 2114, 2135, 2147, 2151, 2158, 2164]
Batch examples lengths: [903, 905, 910, 910, 910, 910, 914, 915, 916, 919]
Batch examples lengths: [968, 968, 970, 970, 971, 972, 973, 975, 981, 982]
Batch examples lengths: [806, 806, 807, 807, 808, 809, 810, 810, 811, 811]
Batch examples lengths: [731, 733, 734, 735, 736, 736, 737, 737, 738, 739]
Batch examples lengths: [357, 357, 358, 361, 362, 362, 362, 364, 366, 371]
Batch examples lengths: [2330, 2335, 2337, 2350, 2351, 2353, 2367, 2374, 2376, 2383]
Batch examples lengths: [1916, 1920, 1921, 1936, 1951, 1953, 1967, 1970, 1981, 1985]
Batch examples lengths: [1395, 1398, 1399, 1402, 1403, 1412, 1412, 1413, 1414, 1414]
Using PyTorchText TabularDataset
Now I will use the TabularDataset functionality which creates the PyTorchDataset object right from our local files.
We donβt need to create a custom PyTorch Dataset class to load our dataset as long as we have tabular files of our data.
Data to Files
Since our dataset is scattered into multiple files, I created a function files_to_tsv
which puts our dataset into a .tsv
file (Tab-Separated Values).
Since Iβll use the TabularDataset from pytorch.data
I need to pass tabular format files.
For text data I find the Tab Separated Values format easier to deal with.
I will call the files_to_tsv
function for each of the two partitions train and test.
The function will return the name of the .tsv
file saved so we can use it later in PyTorchText.
/content/aclImdb/train
pos Files: 100% |ββββββββββββββββββββββββββββββββ| 12500/12500 [00:34<00:00, 367.26it/s]
neg Files: 100% |ββββββββββββββββββββββββββββββββ| 12500/12500 [00:21<00:00, 573.00it/s] /content/aclImdb/test
pos Files: 100% |ββββββββββββββββββββββββββββββββ| 12500/12500 [00:11<00:00, 1075.80it/s]
neg Files: 100% |ββββββββββββββββββββββββββββββββ| 12500/12500 [00:12<00:00, 1037.94it/s]
TabularDataset
Here I setup the data fields for PyTorchText. We have to tell the library how to handle each column of the .tsv
file. For this we need to create data.Field
objects for each column.
text_tokenizer
: For this example I don't use an actual tokenizer for the text
column but I need to create one because it requires as input. I created a dummy tokenizer that returns same value. Depending on the project, here is where you will have your own tokenizer. It needs to take as input text and output a list.
label_tokenizer
The label tokenizer is also a dummy tokenizer. This is where you will have a encoder to transform labels to ids.
Since we have two .tsv
files it's great that we can use the .split
function from TabularDataset to handle two files at the same time one for train and the other one for test.
Find more details about torchtext.data functionality here.
PyTorchText Bucket Iterator Dataloader
Iβm using same setup as in the PyTorchText Bucket Iterator Dataloader code cell section. The only difference is in the sort_key
since there is different way to access example attributes (we had dictionary format before).
Compare DataLoaders
Letβs compare the PyTorch DataLoader batches with the PyTorchText BucketIterator batches created with TabularDataset. We can see how nicely examples of similar length are grouped in same batch with PyTorchText.
Note: When using the PyTorchText BucketIterator, make sure to call create_batches()
before looping through each batch! Else you won't get any output form the iterator.
PyTorch DataLoaderBatch size: 10LABEL LENGTH TEXT
neg 1205 This movie is bad news and I'm really surprised at...
pos 762 Micro-phonies is a classic Stooge short. The guys ...
pos 782 After becoming completely addicted to Six Feet Und...
neg 1708 You do realize that you've been watching the EXACT...
neg 2341 Okay, as a long time Disney fan, I really -hate- d...
neg 705 This movie is simply not worth the time or money s...
neg 1370 I'm sorry to say that there isn't really any way, ...
pos 681 Something about "Paulie" touched my heart as few m...
neg 1401 It really is that bad of a movie. My buddy rented ...
neg 656 my friend bought the movie for 5β¬ (its is not even...PyTorchText BuketIteratorBatch size: 10LABEL LENGTH TEXT
pos 609 That's My Bush is a live action project made by So...
neg 610 Terminus Paradis was exceptional, but "Niki ardele...
neg 612 Awesomely improbable and foolish potboiler that at...
pos 613 Okay, first of all I got this movie as a Christmas...
neg 617 I have been known to fall asleep during films, but...
pos 625 Fragglerock is excellent in the way that Schindler...
neg 625 Sure I've seen bad movies in my life, but this one...
pos 626 This film is excellently paced, you never have to ...
Train Loop Examples
Now letβs look at a model training loop would look like. I printed the first 10 batches list of examples lengths to show how nicely they are grouped throughout the dataset!
We see that we get same exact behavior as we did when using PyTorch Dataset. Now it depends on which way is easier for you to use PyTorchText BucketIterator: with PyTorch Dataset or with PyTorchText TabularDataset
Batch examples lengths: [848, 848, 849, 849, 850, 852, 853, 854, 856, 857]
Batch examples lengths: [779, 780, 780, 781, 781, 782, 782, 782, 783, 784]
Batch examples lengths: [2100, 2103, 2104, 2109, 2114, 2135, 2147, 2151, 2158, 2164]
Batch examples lengths: [903, 905, 910, 910, 910, 910, 914, 915, 916, 919]
Batch examples lengths: [968, 968, 970, 970, 971, 972, 973, 975, 981, 981]
Batch examples lengths: [806, 806, 807, 807, 808, 809, 810, 810, 811, 811]
Batch examples lengths: [731, 733, 734, 735, 736, 736, 737, 737, 738, 739]
Batch examples lengths: [357, 357, 358, 361, 362, 362, 362, 364, 366, 371]
Batch examples lengths: [2330, 2335, 2337, 2350, 2351, 2353, 2367, 2374, 2376, 2381]
Batch examples lengths: [1916, 1920, 1921, 1936, 1951, 1953, 1967, 1970, 1981, 1985]
Batch examples lengths: [1395, 1398, 1399, 1402, 1403, 1412, 1412, 1413, 1414, 1414]
Final Note
If you made it this far Congrats! π and Thank you! π for your interest in my tutorial!
Iβve been using this code for a while now and I feel it got to a point where is nicely documented and easy to follow.
Of course is easy for me to follow because I built it. That is why any feedback is welcome and it helps me improve my future tutorials!
If you see something wrong please let me know by opening an issue on my ml_things GitHub repository!
A lot of tutorials out there are mostly a one-time thing and are not being maintained. I plan on keeping my tutorials up to date as much as I can.
π¦ GitHub: gmihaila
π Website: gmihaila.github.io
π LinkedIn: mihailageorge
π¬ Email: georgemihaila@my.unt.edu.com
Originally published at https://gmihaila.github.io.