πŸ‡ Better Batches with PyTorchText BucketIterator

How to use PyTorchText BucketIterator to sort text data for better batching.

  • Using PyTorch Text TabularDataset with PyTorchText Bucket Iterator: Here I use the built-in PyTorchText TabularDataset that reads data straight from local files without the need to create a PyTorch Dataset class. Then I follow same steps as in the previous part to show how nicely text examples are grouped together.

What should I know for this notebook?

Some basic PyTorch regarding Dataset class and using DataLoaders. Some knowledge of PyTorchText is helpful but not critical in understanding this tutorial. The BucketIterator is similar in applying Dataloader to a PyTorch Dataset.

How to use this notebook?

The code is made with reusability in mind. It can be easily adapted for other text datasets and other NLP tasks in order to achieve optimal batching.

Dataset

I will use the well known movies reviews positive β€” negative labeled Large Movie Review Dataset.

Coding

Now let’s do some coding! We will go through each coding cell in the notebook and describe what it does, what’s the code, and when is relevant β€” show the output.

Downloads

Download the IMDB Movie Reviews sentiment dataset and unzip it locally.

Installs

Imports

Import all needed libraries for this notebook.

  • train_batch_size - Batch size used on train data.
  • valid_batch_size - Batch size used for validation data. It usually is greater than train_batch_size since the model would only need to make prediction and no gradient calculations is needed.

Using PyTorch Dataset

This is where I create the PyTorch Dataset objects for training and validation that can be used to feed data into a model. This is standard procedure when using PyTorch.

Dataset Class

Implementation of the PyTorch Dataset class.

  • __getitem__(self, item) where given an index item will return the example corresponding to the item position.

Train β€” Validation Datasets

Create PyTorch Dataset for train and validation partitions.

PyTorch DataLoader

In order to group examples from the PyTorch Dataset into batches we use PyTorch DataLoader. This is standard when using PyTorch.

PyTorchText Bucket Iterator Dataloader

Here is where the magic happens! We pass in the train_dataset and valid_dataset PyTorch Dataset splits into BucketIterator to create the actual batches.

Compare DataLoaders

Let’s compare the PyTorch DataLoader batches with the PyTorchText BucketIterator batches. We can see how nicely examples of similar length are grouped in same batch with PyTorchText.

PyTorch DataLoaderBatch size: 10LABEL LENGTH TEXT
pos 1037 Fascinating movie, based on a true story, about an...
neg 1406 Or maybe that's what it feels like. Anyway, "The B...
pos 679 Far by my most second favourite cartoon Spielberg ...
neg 922 This movie reminds me of "IrrΓ©versible (2002)", an...
pos 214 There's never a dull moment in this movie. Wonderf...
neg 1288 I don't think any player in Hollywood history last...
pos 605 The thing I remember most about this film is that ...
pos 1411 Fabulous, fantastic, probably Disney's best musica...
neg 604 Just another film that exploits gratuitous frontal...
pos 368 What can i say about the first film ever?<br /><br...

PyTorchText BuketIterator
Batch size: 10LABEL LENGTH TEXT
pos 609 That's My Bush is a live action project made by So...
neg 610 Terminus Paradis was exceptional, but "Niki ardele...
neg 612 Awesomely improbable and foolish potboiler that at...
pos 613 The events of September 11 2001 do not need extra ...
pos 613 Okay, first of all I got this movie as a Christmas...
neg 617 I have been known to fall asleep during films, but...
pos 625 Fragglerock is excellent in the way that Schindler...
neg 625 Sure I've seen bad movies in my life, but this one...
neg 626 Even 20+ years later, Ninja Mission stands out as ...
pos 626 This film is excellently paced, you never have to ...

Train Loop Examples

Now let’s look at a model training loop would look like. I printed the first 10 batches list of examples lengths to show how nicely they are grouped throughout the dataset!

Batch examples lengths: [848, 848, 849, 849, 850, 852, 853, 854, 856, 857] 
Batch examples lengths: [779, 780, 780, 781, 781, 782, 782, 782, 783, 784]
Batch examples lengths: [2100, 2103, 2104, 2109, 2114, 2135, 2147, 2151, 2158, 2164]
Batch examples lengths: [903, 905, 910, 910, 910, 910, 914, 915, 916, 919]
Batch examples lengths: [968, 968, 970, 970, 971, 972, 973, 975, 981, 982]
Batch examples lengths: [806, 806, 807, 807, 808, 809, 810, 810, 811, 811]
Batch examples lengths: [731, 733, 734, 735, 736, 736, 737, 737, 738, 739]
Batch examples lengths: [357, 357, 358, 361, 362, 362, 362, 364, 366, 371]
Batch examples lengths: [2330, 2335, 2337, 2350, 2351, 2353, 2367, 2374, 2376, 2383]
Batch examples lengths: [1916, 1920, 1921, 1936, 1951, 1953, 1967, 1970, 1981, 1985]
Batch examples lengths: [1395, 1398, 1399, 1402, 1403, 1412, 1412, 1413, 1414, 1414]

Using PyTorchText TabularDataset

Now I will use the TabularDataset functionality which creates the PyTorchDataset object right from our local files.

Data to Files

Since our dataset is scattered into multiple files, I created a function files_to_tsv which puts our dataset into a .tsv file (Tab-Separated Values).

/content/aclImdb/train 
pos Files: 100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12500/12500 [00:34<00:00, 367.26it/s]
neg Files: 100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12500/12500 [00:21<00:00, 573.00it/s]
/content/aclImdb/test
pos Files: 100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12500/12500 [00:11<00:00, 1075.80it/s]
neg Files: 100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12500/12500 [00:12<00:00, 1037.94it/s]

TabularDataset

Here I setup the data fields for PyTorchText. We have to tell the library how to handle each column of the .tsv file. For this we need to create data.Field objects for each column.

PyTorchText Bucket Iterator Dataloader

I’m using same setup as in the PyTorchText Bucket Iterator Dataloader code cell section. The only difference is in the sort_key since there is different way to access example attributes (we had dictionary format before).

Compare DataLoaders

Let’s compare the PyTorch DataLoader batches with the PyTorchText BucketIterator batches created with TabularDataset. We can see how nicely examples of similar length are grouped in same batch with PyTorchText.

PyTorch DataLoaderBatch size: 10LABEL LENGTH TEXT
neg 1205 This movie is bad news and I'm really surprised at...
pos 762 Micro-phonies is a classic Stooge short. The guys ...
pos 782 After becoming completely addicted to Six Feet Und...
neg 1708 You do realize that you've been watching the EXACT...
neg 2341 Okay, as a long time Disney fan, I really -hate- d...
neg 705 This movie is simply not worth the time or money s...
neg 1370 I'm sorry to say that there isn't really any way, ...
pos 681 Something about "Paulie" touched my heart as few m...
neg 1401 It really is that bad of a movie. My buddy rented ...
neg 656 my friend bought the movie for 5€ (its is not even...
PyTorchText BuketIteratorBatch size: 10LABEL LENGTH TEXT
pos 609 That's My Bush is a live action project made by So...
neg 610 Terminus Paradis was exceptional, but "Niki ardele...
neg 612 Awesomely improbable and foolish potboiler that at...
pos 613 Okay, first of all I got this movie as a Christmas...
neg 617 I have been known to fall asleep during films, but...
pos 625 Fragglerock is excellent in the way that Schindler...
neg 625 Sure I've seen bad movies in my life, but this one...
pos 626 This film is excellently paced, you never have to ...

Train Loop Examples

Now let’s look at a model training loop would look like. I printed the first 10 batches list of examples lengths to show how nicely they are grouped throughout the dataset!

Batch examples lengths: [848, 848, 849, 849, 850, 852, 853, 854, 856, 857]
Batch examples lengths: [779, 780, 780, 781, 781, 782, 782, 782, 783, 784]
Batch examples lengths: [2100, 2103, 2104, 2109, 2114, 2135, 2147, 2151, 2158, 2164]
Batch examples lengths: [903, 905, 910, 910, 910, 910, 914, 915, 916, 919]
Batch examples lengths: [968, 968, 970, 970, 971, 972, 973, 975, 981, 981]
Batch examples lengths: [806, 806, 807, 807, 808, 809, 810, 810, 811, 811]
Batch examples lengths: [731, 733, 734, 735, 736, 736, 737, 737, 738, 739]
Batch examples lengths: [357, 357, 358, 361, 362, 362, 362, 364, 366, 371]
Batch examples lengths: [2330, 2335, 2337, 2350, 2351, 2353, 2367, 2374, 2376, 2381]
Batch examples lengths: [1916, 1920, 1921, 1936, 1951, 1953, 1967, 1970, 1981, 1985]
Batch examples lengths: [1395, 1398, 1399, 1402, 1403, 1412, 1412, 1413, 1414, 1414]

Final Note

If you made it this far Congrats! 🎊 and Thank you! πŸ™ for your interest in my tutorial!

PhD Computer Science πŸ‘¨β€πŸ’» | Working πŸ‹οΈ with love ❀️ on Deep Learning πŸ€– & Natural Language Processing πŸ—£οΈ.