How to start with Transformers?

First install the Transformers from Hugging Face.

!pip install -q git+

Then I would assume you will be using either TensorFlow or PyTorch. Make sure you have one of these installed and check the version you have.

import tensorflow as tf


import torch

Major objects in use

There are three kind of major classes/objects in Transformers:

  • configuration class
  • model class and
  • tokenizer class

Using the pre-trained bins for GPT2

All three: the model, the configuration and the tokenizer you can load using the from_pretrained method. Also they can be saved using the save_pretrained() method.

_Example: _

from transformers import GPT2Model, GPT2Tokenizer
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2-large")# auto loads the config
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")

If on TensorFlow use TFGPT2Model instead of GPT2Model.

You may soon note the tokenizer class is the same for TensorFlow and PyTorch but the TensorFlow model has the TF prefix (TFBertModel). This is because at first the library was called PyTorch Transformers and it was originally created in PyTorch. Later on they added TF prefix for all model class names to be used in TensorFlow.

Picking the right model

There are several GPT2 models to peak:

# "gpt2": ""
# "gpt2-medium": ""
# "gpt2-large": ""
# "gpt2-xl": ""
# "distilgpt2": ""

All you need to do if you would like to check the distilled GPT-2 is to write:

gpt2_model = GPT2LMHeadModel.from_pretrained("distilgpt2")

Generating text using GPT-2

Let’s use the GTP-2 large model.

from transformers import GPT2LMHeadModel, GPT2Tokenizer
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2-large")
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")

You can get the number of parameters for the model like this:

sum(p.numel() for p in gpt2_model.parameters() if p.requires_grad)



This is a very big model with almost a billion parameters.

The gpt2-xl model should have about 1.5B parameters.

Here is how you can make it complete your thought.

import random
import numpy as np
seed = random.randint(0, 13)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
text = """All of this is right here, ready to be used
in your favorite pizza recipes."""
input_ids = torch.tensor(gpt2_tokenizer.encode(text, add_special_tokens=True)).unsqueeze(0) # bs=1
outputs = gpt2_model.generate(, 
print(gpt2_tokenizer.decode(outputs[0], skip_special_tokens=True))
outputs.shape,outputs[0].shape # (torch.Size([1, 500]), torch.Size([500]))


All of this is right here, ready to be used
in your favorite pizza recipes.
And if you want to try the recipe as written, you can use the "pizza dough" from the  recipe.
And since you probably want to make this pizza in a large casserole dish, here are the measurements and ingredients:
2 cups (1 lb) grated cheese
1 cup (1 stick) butter
1 cup (1 stick) flour
1 1/2 cups (5 1/2 oz) grated mozzarella cheese
1/4 cup (1/2 stick) sugar
1/4 cup (1/2 stick) cornstarch
1 1/2 cups (5 oz) water
1. Preheat the oven to 350 degrees F.
2. In a large bowl, mix the cheese, butter, flour and cornstarch.
3. In a small bowl, whisk together the water and 1/2 cup of the cheese mixture.
4. Pour the mixture into the casserole dish and bake for 30 minutes or until the cheese is melted.
5. Remove from the oven and let cool.
6. While the cheese is cooling, melt the remaining 2 cups of the cheese mixture in a large heavy bottomed pot.
7. Add the milk and mix until the mixture is smooth.
8. Add the flour and mix to combine.
9. Add the remaining 1 cup of cheese mixture, cornstarch and water and mix until the mixture is smooth.
10. Pour the cheese mixture into the prepared casserole dish and bake for 40 minutes or until the cheese is well melted and the topping is bubbly.
11. Serve with your favorite toppings and enjoy!
Note: If you are not using a large casserole dish, you may need to use 1 cup of the cheese mixture for each cup of filling.
If you are making this recipe, please do let us know what you think by posting a comment below.
If you enjoyed this recipe, please share with your friends by clicking on any of the social media icons below, or by clicking the "Share This Recipe" button below. Thank you!

Why did we use the GPT2LMHeadModel?

You may note we used GPT2LMHeadModel class with the extra linear layer at the end (head)

(lm_head): Linear(in_features=768, out_features=50257, bias=False)

This extra layer will convert the hidden states (exactly 768 hidden states) into 50257 out_features. Why is the 50257 important?

This is the vocab size of the GPT-2 transformer. Based on that at the end we will get the probabilities for each word from the vocab.

If we would not have the last layer we would not have the means to interpret the hidden states, or in other words our output would be the hidden states.


import torch
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # bs=1
outputs = model(input_ids)
outputs_batch_0 = outputs[0] # 0 -> first batch
input_ids.shape, outputs_batch_0.shape


(torch.Size([1, 6]), torch.Size([1, 6, 768]))

Note how the out_hidden_states corresponds to the input_ids, word each input word we add the extra hidden dimension.

The LMHead part is there to extract the information from the hidden states and to convert them to output tokens.