Multilayer Perceptron Character-Level Language Model
Motivation
This is a simple implementation of a Multilayer perceptron (MLP) character-level language model. The code and documentation directly follows Andrej Karpathy’s excellent YouTube video on the same topic. You can find the video here: https://www.youtube.com/watch?v=TCH_1BHY58I
Andrej uses a database of people names, and then builds a model to generate new baby names. I will use a database of heavy metal band names, and my model will generate new band names :) You can download the dataset from here - https://github.com/OpenJarbas/metal_dataset/blob/master/heavy_metal_bands.txt
I’m a huge org-mode user, and love the idea of literate programming. This is an attempt to write code in org-mode, and document it as I go along. I much prefer org-mode over Jupyter notebooks, and I think it’s a great way to document code, and also have access to my notes in various other documents. Yeah, I don’t use the mouse much! :)
This blog post is divided into sections, and you can follow along just like the video. Although I have mainly used org-mode to follow along with Andrej’s video, I thought I’d also create a blog post out of it.
Initial setup
Working dir
|
|
Python venv
Create a Python virtual environment and install the necessary libraries.
|
|
Activate it in org-mode.
|
|
Import libraries.
|
|
Read the dataset
Lets explore the dataset. Here we can check what are the unique characters including punctuations if any. Also we determine the vocabulary size.
Based on my exploration, I choose ^
as the start and end token as it doesn’t appear in the dataset. This is important as we need to know when a word starts and ends. We also need to know when we have to predict the next character.
|
|
Output
|
|
Build the dataset
We loop over each word in the dataset and create a context of 3 characters. These 3 characters are used to predict the next one. Why 3? We earlier experimented with a bigram model which predicts the next character based on the probability distribution of a character following a previous characeter. Now we use 3! We can expand this to 4, or even 5. This is the advantage of using neural networks over state of the art n-gram models.
For each word, we then do a rolling window over the context, to get all the inputs each of size 3, put this into X (input), and the next character into Y (label).
|
|
Output
|
|
Se we have 165525 samples, each with 3 characters, and the corresponding label.
Embedding layer
What is embedding? It’s a way to represent a character as a vector. We can use one-hot encoding, but that’s not very efficient. Instead we use embeddings. Encoding a character as a vector allows us to learn the relationship between characters.
For a vocabulary of size 27, we use embeddings of size 2 per character. So each character is represented as a 2-dimensional vector.
|
|
Output
|
|
Hidden layer
Since we have 3 inputs, and each input is a 2-dimensional embedding, the input to the hidden layer is 6-dimensional.
We choose the size of the hidden layer to be 100. While optimizing the model, we can experiment with different sizes of the hidden layer, and see how it affects the model.
|
|
Output
|
|
Output layer
The output has to be a character from the vocabulary. So we have a linear layer with 79 outputs, which is the size of the vocabulary.
|
|
Output
|
|
We need to get this loss to be as low as possible. We do this by updating the weights and biases (backprogation). But first lets organize our code so its easier to run multiple invocations of the forward pass, inspect the loss, as well as later on adjust the hyperparameters.
Organizing the code (Parameters)
|
|
Output
|
|
So our model has 8837 parameters. Later on as we optimize our model, we may want to adjust the size of the hidden layer, the number of embeddings, etc. This will increase the number of parameters and possible improve the model. And your computation costs! So optimize wisely.
Requires grad
We need to update the weights and biases. We do this by backpropagation. We need to tell PyTorch that we want to update these parameters. We do this by setting the requiresgrad attribute to True. And behind the scenes, PyTorch will keep track of the gradients for us.
|
|
Train/validation/test split
If we train our model on our entire dataset, we may overfit. What this means is that our model will be very good at predicting the characters in our dataset, but not so good at predicting characters it has never seen before. To avoid this, we split our dataset into a training, validation, and test set.
The training set is used to optimize the parameters of the model, which is done during our epoch loop. The validation set is used to optimize the hyperparameters of the model, such as the learning rate, the size of the hidden layer, etc. The test set is used to evaluate the model.
Usually we divide the dataset into 80% training, 10% validation, and 10% test.
Build datasets
|
|
Output
|
|
Training loop
Learning rate.
|
|
We now run the below code multiple times to see how the loss decreases. And then execute the training loss block to see the training loss. Since we are decaying the learning rate automatically, we need to manually adjust it.
|
|
Training loss
|
|
Output
|
|
Validation loss
|
|
Output
|
|
Sample from our model
Model inference. We sample from our model to see what it has learned. This is basically a forward pass, but we don’t update the weights.
|
|
Output
|
|
We can see that the model is generating some words, but they are not very good. We need to optimize the model, and experiment with different hyperparameters to get better results.
This is a good exercise in PyTorch as well as learning how to backpropogate using gradient descent.