⚙️ Bert Inner Workings

Let’s look at how an input flows through Bert.

Picture of when I replaced my car’s cylinder head

What should I know for this notebook?

How deep are we going?

Tutorial Structure

Terminology

How to use this notebook?

Dataset

Coding

Installs

Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing wheel metadata ... done
|████████████████████████████████| 2.9MB 6.7MB/s
|████████████████████████████████| 890kB 48.9MB/s
|████████████████████████████████| 1.1MB 49.0MB/s
Building wheel for transformers (PEP 517) ... done
Building wheel for sacremoses (setup.py) ... done
|████████████████████████████████| 71kB 5.2MB/s

Imports

Define Input

Bert Tokenizer

Downloading: 100% |████████████████████████████████| 213k/213k [00:00<00:00, 278kB/s]

PRETTY PRINT OF `input_sequences` UPDATED WITH `labels`:
input_ids : tensor([[ 101, 146, 1567, 11771, 106, 102, 0, 0, 0],
[ 101, 1124, 18457, 10194, 11478, 7136, 13473, 119, 102]])

token_type_ids : tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0]])

attention_mask : tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1]])

labels : tensor([1, 0])

ORIGINAL TEXT:
I love cats!
He hates pineapple pizza.

TEXT AFTER USING `BertTokenizer`:
[CLS] I love cats! [SEP] [PAD] [PAD] [PAD]
[CLS] He hates pineapple pizza. [SEP]

Bert Configuration

Downloading: 100% |████████████████████████████████| 433/433 [00:00<00:00, 15.5kB/s]

NUMBER OF LAYERS: 12
EMBEDDING SIZE: 768
ACTIVATIONS: gelu

Bert For Sequence Classification

Class Call

Downloading: 100% |████████████████████████████████| 436M/436M [00:07<00:00, 61.3MB/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ...
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

FORWARD PASS OUTPUT: SequenceClassifierOutput(loss=tensor(0.7454), logits=tensor([[ 0.2661, -0.1774],
[ 0.2223, -0.0847]]), hidden_states=None, attentions=None)

Class Components

BertModel

Bert Embeddings

Bert Embeddings Diagram
Bert Embeddings Diagram
Bert Embeddings Diagram
Created Tokens Positions IDs:
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8]])

Tokens IDs:
torch.Size([2, 9])

Tokens Type IDs:
torch.Size([2, 9])

Word Embeddings:
torch.Size([2, 9, 768])

Position Embeddings:
torch.Size([1, 9, 768])

Token Types Embeddings:
torch.Size([2, 9, 768])

Sum Up All Embeddings:
torch.Size([2, 9, 768])

Embeddings Layer Nromalization:
torch.Size([2, 9, 768])

Embeddings Dropout Layer:
torch.Size([2, 9, 768])

Bert Encoder

BertSelfAttention Diagram
BertSelfAttention Diagram
BertSelfAttention Diagram
Attention Head Size:
64

Combined Attentions Head Size:
768

Hidden States:
torch.Size([2, 9, 768])

Query Linear Layer:
torch.Size([2, 9, 768])

Key Linear Layer:
torch.Size([2, 9, 768])

Value Linear Layer:
torch.Size([2, 9, 768])

Query:
torch.Size([2, 12, 9, 64])

Key:
torch.Size([2, 12, 9, 64])

Value:
torch.Size([2, 12, 9, 64])

Key Transposed:
torch.Size([2, 12, 64, 9])

Attention Scores:
torch.Size([2, 12, 9, 9])

Attention Scores Divided by Scalar:
torch.Size([2, 12, 9, 9])

Attention Probabilities Softmax Layer:
torch.Size([2, 12, 9, 9])

Attention Probabilities Dropout Layer:
torch.Size([2, 12, 9, 9])

Context:
torch.Size([2, 12, 9, 64])

Context Permute:
torch.Size([2, 9, 12, 64])

Context Reshaped:
torch.Size([2, 9, 768])
BertSelfOutput Diagram
BertSelfOutput Diagram
BertSelfOutput Diagram
Hidden States:
torch.Size([2, 9, 768])

Hidden States Linear Layer:
torch.Size([2, 9, 768])

Hidden States Dropout Layer:
torch.Size([2, 9, 768])

Hidden States Normalization Layer:
torch.Size([2, 9, 768])
BertAttention Diagram
BertAttention Diagram
BertAttention Diagram
Attention Head Size:
64

Combined Attentions Head Size:
768

Hidden States:
torch.Size([2, 9, 768])

Query Linear Layer:
torch.Size([2, 9, 768])

Key Linear Layer:
torch.Size([2, 9, 768])

Value Linear Layer:
torch.Size([2, 9, 768])

Query:
torch.Size([2, 12, 9, 64])

Key:
torch.Size([2, 12, 9, 64])

Value:
torch.Size([2, 12, 9, 64])

Key Transposed:
torch.Size([2, 12, 64, 9])

Attention Scores:
torch.Size([2, 12, 9, 9])

Attention Scores Divided by Scalar:
torch.Size([2, 12, 9, 9])

Attention Probabilities Softmax Layer:
torch.Size([2, 12, 9, 9])

Attention Probabilities Dropout Layer:
torch.Size([2, 12, 9, 9])

Context:
torch.Size([2, 12, 9, 64])

Context Permute:
torch.Size([2, 9, 12, 64])

Context Reshaped:
torch.Size([2, 9, 768])
Hidden States:
torch.Size([2, 9, 768])

Hidden States Linear Layer:
torch.Size([2, 9, 768])

Hidden States Dropout Layer:
torch.Size([2, 9, 768])

Hidden States Normalization Layer:
torch.Size([2, 9, 768])
BertIntermediate Diagram
BertIntermediate Diagram
BertIntermediate Diagram
Hidden States:
torch.Size([2, 9, 768])

Hidden States Linear Layer:
torch.Size([2, 9, 3072])

Hidden States Gelu Activation Function:
torch.Size([2, 9, 3072])
BertOutput Diagram
BertOutput Diagram
BertOutput Diagram
Hidden States:
torch.Size([2, 9, 3072])

Hidden States Linear Layer:
torch.Size([2, 9, 768])

Hidden States Dropout Layer:
torch.Size([2, 9, 768])

Hidden States Layer Normalization:
torch.Size([2, 9, 768])
BertLayer Diagram
BertLayer Diagram
BertLayer Diagram
Attention Head Size:
64

Combined Attentions Head Size:
768

Hidden States:
torch.Size([2, 9, 768])

Query Linear Layer:
torch.Size([2, 9, 768])

Key Linear Layer:
torch.Size([2, 9, 768])

Value Linear Layer:
torch.Size([2, 9, 768])

Query:
torch.Size([2, 12, 9, 64])

Key:
torch.Size([2, 12, 9, 64])

Value:
torch.Size([2, 12, 9, 64])

Key Transposed:
torch.Size([2, 12, 64, 9])

Attention Scores:
torch.Size([2, 12, 9, 9])

Attention Scores Divided by Scalar:
torch.Size([2, 12, 9, 9])

Attention Probabilities Softmax Layer:
torch.Size([2, 12, 9, 9])

Attention Probabilities Dropout Layer:
torch.Size([2, 12, 9, 9])

Context:
torch.Size([2, 12, 9, 64])

Context Permute:
torch.Size([2, 9, 12, 64])

Context Reshaped:
torch.Size([2, 9, 768])
Hidden States:
torch.Size([2, 9, 768])

Hidden States Linear Layer:
torch.Size([2, 9, 768])

Hidden States Dropout Layer:
torch.Size([2, 9, 768])

Hidden States Normalization Layer:
torch.Size([2, 9, 768])

Hidden States:
torch.Size([2, 9, 768])

Hidden States Linear Layer:
torch.Size([2, 9, 3072])

Hidden States Gelu Activation Function:
torch.Size([2, 9, 3072])

Hidden States:
torch.Size([2, 9, 3072])

Hidden States Linear Layer:
torch.Size([2, 9, 768])

Hidden States Dropout Layer:
torch.Size([2, 9, 768])

Hidden States Layer Normalization:
torch.Size([2, 9, 768])
BertEncoder Diagram
BertEncoder Diagram
BertEncoder Diagram
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768

----------------- BERT LAYER 1 -----------------

Hidden States:
torch.Size([2, 9, 768])

Query Linear Layer:
torch.Size([2, 9, 768])

Key Linear Layer:
torch.Size([2, 9, 768])

Value Linear Layer:
torch.Size([2, 9, 768])

Query:
torch.Size([2, 12, 9, 64])

Key:
torch.Size([2, 12, 9, 64])

Value:
torch.Size([2, 12, 9, 64])

Key Transposed:
torch.Size([2, 12, 64, 9])

Attention Scores:
torch.Size([2, 12, 9, 9])

Attention Scores Divided by Scalar:
torch.Size([2, 12, 9, 9])

Attention Probabilities Softmax Layer:
torch.Size([2, 12, 9, 9])

Attention Probabilities Dropout Layer:
torch.Size([2, 12, 9, 9])

Context:
torch.Size([2, 12, 9, 64])

Context Permute:
torch.Size([2, 9, 12, 64])

Context Reshaped:
torch.Size([2, 9, 768])
Hidden States:
torch.Size([2, 9, 768])

Hidden States Linear Layer:
torch.Size([2, 9, 768])

Hidden States Dropout Layer:
torch.Size([2, 9, 768])

Hidden States Normalization Layer:
torch.Size([2, 9, 768])

Hidden States:
torch.Size([2, 9, 768])

Hidden States Linear Layer:
torch.Size([2, 9, 3072])

Hidden States Gelu Activation Function:
torch.Size([2, 9, 3072])

Hidden States:
torch.Size([2, 9, 3072])

Hidden States Linear Layer:
torch.Size([2, 9, 768])

Hidden States Dropout Layer:
torch.Size([2, 9, 768])

Hidden States Layer Normalization:
torch.Size([2, 9, 768])

----------------- BERT LAYER 2 -----------------

...

----------------- BERT LAYER 12 -----------------

BertPooler

BertPooler Diagram
BertPooler Diagram
BertPooler Diagram
Hidden States:
torch.Size([2, 9, 768])

First Token [CLS]:
torch.Size([2, 768])

First Token [CLS] Linear Layer:
torch.Size([2, 768])

First Token [CLS] Tanh Activation Function:
torch.Size([2, 768])

Assemble BertModel

BertModel Diagram
BertModel Diagram
BertModel Diagram
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Created Tokens Positions IDs:
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8]])

Tokens IDs:
torch.Size([2, 9])

Tokens Type IDs:
torch.Size([2, 9])

Word Embeddings:
torch.Size([2, 9, 768])

Position Embeddings:
torch.Size([1, 9, 768])

Token Types Embeddings:
torch.Size([2, 9, 768])

Sum Up All Embeddings:
torch.Size([2, 9, 768])

Embeddings Layer Nromalization:
torch.Size([2, 9, 768])

Embeddings Dropout Layer:
torch.Size([2, 9, 768])

----------------- BERT LAYER 1 -----------------

...

----------------- BERT LAYER 12 -----------------

...

Hidden States:
torch.Size([2, 9, 768])

First Token [CLS]:
torch.Size([2, 768])

First Token [CLS] Linear Layer:
torch.Size([2, 768])

First Token [CLS] Tanh Activation Function:
torch.Size([2, 768])

Assemble Components

Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Attention Head Size:
64

Combined Attentions Head Size:
768
Created Tokens Positions IDs:
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8]])

Tokens IDs:
torch.Size([2, 9])

Tokens Type IDs:
torch.Size([2, 9])

Word Embeddings:
torch.Size([2, 9, 768])

Position Embeddings:
torch.Size([1, 9, 768])

Token Types Embeddings:
torch.Size([2, 9, 768])

Sum Up All Embeddings:
torch.Size([2, 9, 768])

Embeddings Layer Nromalization:
torch.Size([2, 9, 768])

Embeddings Dropout Layer:
torch.Size([2, 9, 768])

----------------- BERT LAYER 1 -----------------

...

----------------- BERT LAYER 12 -----------------

...

Hidden States:
torch.Size([2, 9, 768])

First Token [CLS]:
torch.Size([2, 768])

First Token [CLS] Linear Layer:
torch.Size([2, 768])

First Token [CLS] Tanh Activation Function:
torch.Size([2, 768])

Complete Diagram

Final Note

Contact 🎣

PhD Computer Science 👨‍💻 | Working 🏋️ with love ❤️ on Deep Learning 🤖 & Natural Language Processing 🗣️.