πŸ‡ Better Batches with PyTorchText BucketIterator

How to use PyTorchText BucketIterator to sort text data for better batching.

George Mihaila
11 min readNov 13, 2020

This notebook is a simple tutorial on how to use the powerful PytorchText BucketIterator functionality to group examples ( I use examples and sequences interchangeably) of similar lengths into batches. This allows us to provide the most optimal batches when training models with text data.

Having batches with similar length examples provides a lot of gain for recurrent models (RNN, GRU, LSTM) and transformers models (bert, roBerta, gpt2, xlnet, etc.) where padding will be minimal.

Basically any model that takes as input variable text data sequences will benefit from this tutorial.

I will not train any models in this notebook! I will release a tutorial where I use this implementation to train a transformer model.

The purpose is to use an example text datasets and batch it using PyTorchText with BucketIterator and show how it groups text sequences of similar length in batches.

This tutorial has two main parts:

  • Using PyTorch Dataset with PyTorchText Bucket Iterator: Here I implemented a standard PyTorch Dataset class that reads in the example text datasets and use PyTorch Bucket Iterator to group similar length examples in same batches. I want to show how easy it is to use this powerful functionality form PyTorchText on a regular PyTorch Dataset workflow which you already have setup.
  • Using PyTorch Text TabularDataset with PyTorchText Bucket Iterator: Here I use the built-in PyTorchText TabularDataset that reads data straight from local files without the need to create a PyTorch Dataset class. Then I follow same steps as in the previous part to show how nicely text examples are grouped together.

This notebooks is a code adaptation and implementation inspired from a few sources: torchtext_translation_tutorial, pytorch/text β€” GitHub, torchtext documentation and A Comprehensive Introduction to Torchtext.

What should I know for this notebook?

Some basic PyTorch regarding Dataset class and using DataLoaders. Some knowledge of PyTorchText is helpful but not critical in understanding this tutorial. The BucketIterator is similar in applying Dataloader to a PyTorch Dataset.

How to use this notebook?

The code is made with reusability in mind. It can be easily adapted for other text datasets and other NLP tasks in order to achieve optimal batching.

Comments should provide enough guidance to easily adapt this notebook to your needs.

This code is designed mostly for classification tasks in mind, but it can be adapted for any other Natural Language Processing tasks where batching text data is needed.

Dataset

I will use the well known movies reviews positive β€” negative labeled Large Movie Review Dataset.

The description provided on the Stanford website:

This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.

Why this dataset? I believe is an easy to understand and use dataset for classification. I think sentiment data is always fun to work with.

Coding

Now let’s do some coding! We will go through each coding cell in the notebook and describe what it does, what’s the code, and when is relevant β€” show the output.

I made this format to be easy to follow if you decide to run each code cell in your own python notebook.

When I learn from a tutorial I always try to replicate the results. I believe it’s easy to follow along if you have the code next to the explanations.

Downloads

Download the IMDB Movie Reviews sentiment dataset and unzip it locally.

Installs

Imports

Import all needed libraries for this notebook.

Declare basic parameters used for this notebook:

  • device - Device to use by torch: GPU/CPU. I use CPU as default since I will not perform any costly operations.
  • train_batch_size - Batch size used on train data.
  • valid_batch_size - Batch size used for validation data. It usually is greater than train_batch_size since the model would only need to make prediction and no gradient calculations is needed.

Using PyTorch Dataset

This is where I create the PyTorch Dataset objects for training and validation that can be used to feed data into a model. This is standard procedure when using PyTorch.

Dataset Class

Implementation of the PyTorch Dataset class.

Most important components in a PyTorch Dataset class are:

  • __len__(self, ) where it returns the number of examples in our dataset that we read in __init__(self, ). This will ensure that len() will return the number of examples.
  • __getitem__(self, item) where given an index item will return the example corresponding to the item position.

Train β€” Validation Datasets

Create PyTorch Dataset for train and validation partitions.

PyTorch DataLoader

In order to group examples from the PyTorch Dataset into batches we use PyTorch DataLoader. This is standard when using PyTorch.

PyTorchText Bucket Iterator Dataloader

Here is where the magic happens! We pass in the train_dataset and valid_dataset PyTorch Dataset splits into BucketIterator to create the actual batches.

It’s very nice that PyTorchText can handle splits! No need to write same line of code again for train and validation split.

The sort_key parameter is very important! It is used to order text sequences in batches. Since we want to batch sequences of text with similar length, we will use a simple function that returns the length of an data example ( len(x['text')). This function needs to follow the format of the PyTorch Dataset we created in order to return the length of an example, in my case I return a dictionary with text key for an example.

It is important to keep sort=False and sort_with_batch=True to only sort the examples in each batch and not the examples in the whole dataset!

Find more details in the PyTorchText BucketIterator documentation here β€” look at the BPTTIterator because it has same parameters except the bptt_len argument.

Note:If you want just a single DataLoader use torchtext.data.BucketIterator instead of torchtext.data.BucketIterator.splits and make sure to provide just one PyTorch Dataset instead of tuple of PyTorch Datasets and change the parameter batch_sizes and its tuple values to batch_size with single value: dataloader = torchtext.data.BucketIterator(dataset, batch_size=batch_size, ).

Compare DataLoaders

Let’s compare the PyTorch DataLoader batches with the PyTorchText BucketIterator batches. We can see how nicely examples of similar length are grouped in same batch with PyTorchText.

Note: When using the PyTorchText BucketIterator, make sure to call create_batches() before looping through each batch! Else you won't get any output form the iterator.

PyTorch DataLoaderBatch size: 10LABEL LENGTH TEXT
pos 1037 Fascinating movie, based on a true story, about an...
neg 1406 Or maybe that's what it feels like. Anyway, "The B...
pos 679 Far by my most second favourite cartoon Spielberg ...
neg 922 This movie reminds me of "IrrΓ©versible (2002)", an...
pos 214 There's never a dull moment in this movie. Wonderf...
neg 1288 I don't think any player in Hollywood history last...
pos 605 The thing I remember most about this film is that ...
pos 1411 Fabulous, fantastic, probably Disney's best musica...
neg 604 Just another film that exploits gratuitous frontal...
pos 368 What can i say about the first film ever?<br /><br...

PyTorchText BuketIterator
Batch size: 10LABEL LENGTH TEXT
pos 609 That's My Bush is a live action project made by So...
neg 610 Terminus Paradis was exceptional, but "Niki ardele...
neg 612 Awesomely improbable and foolish potboiler that at...
pos 613 The events of September 11 2001 do not need extra ...
pos 613 Okay, first of all I got this movie as a Christmas...
neg 617 I have been known to fall asleep during films, but...
pos 625 Fragglerock is excellent in the way that Schindler...
neg 625 Sure I've seen bad movies in my life, but this one...
neg 626 Even 20+ years later, Ninja Mission stands out as ...
pos 626 This film is excellently paced, you never have to ...

Train Loop Examples

Now let’s look at a model training loop would look like. I printed the first 10 batches list of examples lengths to show how nicely they are grouped throughout the dataset!

Batch examples lengths: [848, 848, 849, 849, 850, 852, 853, 854, 856, 857] 
Batch examples lengths: [779, 780, 780, 781, 781, 782, 782, 782, 783, 784]
Batch examples lengths: [2100, 2103, 2104, 2109, 2114, 2135, 2147, 2151, 2158, 2164]
Batch examples lengths: [903, 905, 910, 910, 910, 910, 914, 915, 916, 919]
Batch examples lengths: [968, 968, 970, 970, 971, 972, 973, 975, 981, 982]
Batch examples lengths: [806, 806, 807, 807, 808, 809, 810, 810, 811, 811]
Batch examples lengths: [731, 733, 734, 735, 736, 736, 737, 737, 738, 739]
Batch examples lengths: [357, 357, 358, 361, 362, 362, 362, 364, 366, 371]
Batch examples lengths: [2330, 2335, 2337, 2350, 2351, 2353, 2367, 2374, 2376, 2383]
Batch examples lengths: [1916, 1920, 1921, 1936, 1951, 1953, 1967, 1970, 1981, 1985]
Batch examples lengths: [1395, 1398, 1399, 1402, 1403, 1412, 1412, 1413, 1414, 1414]

Using PyTorchText TabularDataset

Now I will use the TabularDataset functionality which creates the PyTorchDataset object right from our local files.

We don’t need to create a custom PyTorch Dataset class to load our dataset as long as we have tabular files of our data.

Data to Files

Since our dataset is scattered into multiple files, I created a function files_to_tsv which puts our dataset into a .tsv file (Tab-Separated Values).

Since I’ll use the TabularDataset from pytorch.data I need to pass tabular format files.

For text data I find the Tab Separated Values format easier to deal with.

I will call the files_to_tsv function for each of the two partitions train and test.

The function will return the name of the .tsv file saved so we can use it later in PyTorchText.

/content/aclImdb/train 
pos Files: 100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12500/12500 [00:34<00:00, 367.26it/s]
neg Files: 100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12500/12500 [00:21<00:00, 573.00it/s]
/content/aclImdb/test
pos Files: 100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12500/12500 [00:11<00:00, 1075.80it/s]
neg Files: 100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12500/12500 [00:12<00:00, 1037.94it/s]

TabularDataset

Here I setup the data fields for PyTorchText. We have to tell the library how to handle each column of the .tsv file. For this we need to create data.Field objects for each column.

text_tokenizer: For this example I don't use an actual tokenizer for the text column but I need to create one because it requires as input. I created a dummy tokenizer that returns same value. Depending on the project, here is where you will have your own tokenizer. It needs to take as input text and output a list.

label_tokenizer The label tokenizer is also a dummy tokenizer. This is where you will have a encoder to transform labels to ids.

Since we have two .tsv files it's great that we can use the .split function from TabularDataset to handle two files at the same time one for train and the other one for test.

Find more details about torchtext.data functionality here.

PyTorchText Bucket Iterator Dataloader

I’m using same setup as in the PyTorchText Bucket Iterator Dataloader code cell section. The only difference is in the sort_key since there is different way to access example attributes (we had dictionary format before).

Compare DataLoaders

Let’s compare the PyTorch DataLoader batches with the PyTorchText BucketIterator batches created with TabularDataset. We can see how nicely examples of similar length are grouped in same batch with PyTorchText.

Note: When using the PyTorchText BucketIterator, make sure to call create_batches() before looping through each batch! Else you won't get any output form the iterator.

PyTorch DataLoaderBatch size: 10LABEL LENGTH TEXT
neg 1205 This movie is bad news and I'm really surprised at...
pos 762 Micro-phonies is a classic Stooge short. The guys ...
pos 782 After becoming completely addicted to Six Feet Und...
neg 1708 You do realize that you've been watching the EXACT...
neg 2341 Okay, as a long time Disney fan, I really -hate- d...
neg 705 This movie is simply not worth the time or money s...
neg 1370 I'm sorry to say that there isn't really any way, ...
pos 681 Something about "Paulie" touched my heart as few m...
neg 1401 It really is that bad of a movie. My buddy rented ...
neg 656 my friend bought the movie for 5€ (its is not even...
PyTorchText BuketIteratorBatch size: 10LABEL LENGTH TEXT
pos 609 That's My Bush is a live action project made by So...
neg 610 Terminus Paradis was exceptional, but "Niki ardele...
neg 612 Awesomely improbable and foolish potboiler that at...
pos 613 Okay, first of all I got this movie as a Christmas...
neg 617 I have been known to fall asleep during films, but...
pos 625 Fragglerock is excellent in the way that Schindler...
neg 625 Sure I've seen bad movies in my life, but this one...
pos 626 This film is excellently paced, you never have to ...

Train Loop Examples

Now let’s look at a model training loop would look like. I printed the first 10 batches list of examples lengths to show how nicely they are grouped throughout the dataset!

We see that we get same exact behavior as we did when using PyTorch Dataset. Now it depends on which way is easier for you to use PyTorchText BucketIterator: with PyTorch Dataset or with PyTorchText TabularDataset

Batch examples lengths: [848, 848, 849, 849, 850, 852, 853, 854, 856, 857]
Batch examples lengths: [779, 780, 780, 781, 781, 782, 782, 782, 783, 784]
Batch examples lengths: [2100, 2103, 2104, 2109, 2114, 2135, 2147, 2151, 2158, 2164]
Batch examples lengths: [903, 905, 910, 910, 910, 910, 914, 915, 916, 919]
Batch examples lengths: [968, 968, 970, 970, 971, 972, 973, 975, 981, 981]
Batch examples lengths: [806, 806, 807, 807, 808, 809, 810, 810, 811, 811]
Batch examples lengths: [731, 733, 734, 735, 736, 736, 737, 737, 738, 739]
Batch examples lengths: [357, 357, 358, 361, 362, 362, 362, 364, 366, 371]
Batch examples lengths: [2330, 2335, 2337, 2350, 2351, 2353, 2367, 2374, 2376, 2381]
Batch examples lengths: [1916, 1920, 1921, 1936, 1951, 1953, 1967, 1970, 1981, 1985]
Batch examples lengths: [1395, 1398, 1399, 1402, 1403, 1412, 1412, 1413, 1414, 1414]

Final Note

If you made it this far Congrats! 🎊 and Thank you! πŸ™ for your interest in my tutorial!

I’ve been using this code for a while now and I feel it got to a point where is nicely documented and easy to follow.

Of course is easy for me to follow because I built it. That is why any feedback is welcome and it helps me improve my future tutorials!

If you see something wrong please let me know by opening an issue on my ml_things GitHub repository!

A lot of tutorials out there are mostly a one-time thing and are not being maintained. I plan on keeping my tutorials up to date as much as I can.

🦊 GitHub: gmihaila

🌐 Website: gmihaila.github.io

πŸ‘” LinkedIn: mihailageorge

πŸ“¬ Email: georgemihaila@my.unt.edu.com

Originally published at https://gmihaila.github.io.

--

--

George Mihaila

PhD Computer Science πŸ‘¨β€πŸ’» | Working πŸ‹οΈ with love ❀️ on Deep Learning πŸ€– & Natural Language Processing πŸ—£οΈ.