Multilayer Perceptron Character-Level Language Model

March 29, 2024

Motivation

This is a simple implementation of a Multilayer perceptron (MLP) character-level language model. The code and documentation directly follows Andrej Karpathy’s excellent YouTube video on the same topic. You can find the video here: https://www.youtube.com/watch?v=TCH_1BHY58I

Andrej uses a database of people names, and then builds a model to generate new baby names. I will use a database of heavy metal band names, and my model will generate new band names :) You can download the dataset from here - https://github.com/OpenJarbas/metal_dataset/blob/master/heavy_metal_bands.txt

I’m a huge org-mode user, and love the idea of literate programming. This is an attempt to write code in org-mode, and document it as I go along. I much prefer org-mode over Jupyter notebooks, and I think it’s a great way to document code, and also have access to my notes in various other documents. Yeah, I don’t use the mouse much! :)

This blog post is divided into sections, and you can follow along just like the video. Although I have mainly used org-mode to follow along with Andrej’s video, I thought I’d also create a blog post out of it.

Initial setup

Working dir

1
2


    mkdir makemore
    cd makemore

Python venv

Create a Python virtual environment and install the necessary libraries.

1
2
3


    python -m venv .venv
    source .venv/bin/activate
    pip install torch matplotlib numpy

Activate it in org-mode.

1

    (setq org-babel-python-command "/home/rohan/work/makemore/.venv/bin/python")

Import libraries.

1
2
3
4
5


    from pprint import pprint
    import torch
    import matplotlib.pyplot as plt
    import numpy as np
    import torch.nn.functional as F

Read the dataset

Lets explore the dataset. Here we can check what are the unique characters including punctuations if any. Also we determine the vocabulary size.

Based on my exploration, I choose ^ as the start and end token as it doesn’t appear in the dataset. This is important as we need to know when a word starts and ends. We also need to know when we have to predict the next character.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


    # open file and get each line as a list
    words = open("./heavy_metal_bands.txt").read().splitlines()
    print(f"Number of words: {len(words)}")
    
    # just print the first few words
    pprint(words[:5])
    
    # vocabulary of characters and mappings to/from integers
    start_end_token = "^"
    chars = sorted(list(set("".join(words))))
    stoi = {ch: i+1 for i, ch in enumerate(chars)}
    stoi[start_end_token] = 0
    itos = {i: ch for ch, i in stoi.items()}
    
    vocab_size = len(stoi)
    
    # f-strings, print var and val :)
    print(f"{vocab_size=}")
    
    # pprint(chars)
    # pprint(stoi)
    # pprint(itos)

Output

1
2
3
4
5
6
7


    Number of words: 15294
    ['$uicide $olution',
     '10-67 P.D.O.A.',
     '107 Pasos',
     '12 Gauge Outrage',
     '12th Ode']
    vocab_size=79

Build the dataset

We loop over each word in the dataset and create a context of 3 characters. These 3 characters are used to predict the next one. Why 3? We earlier experimented with a bigram model which predicts the next character based on the probability distribution of a character following a previous characeter. Now we use 3! We can expand this to 4, or even 5. This is the advantage of using neural networks over state of the art n-gram models.

For each word, we then do a rolling window over the context, to get all the inputs each of size 3, put this into X (input), and the next character into Y (label).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


    block_size = 3 # context length: how many characters we take to predict the next one
    X, Y = [], [] # input data and labels
    
    for w in words:
        # print(w)
        context = [0] * block_size
    
        for ch in w + start_end_token:
            ix = stoi[ch]
            X.append(context)
            Y.append(ix)
    
            # print(''.join([itos[i] for i in context]), '->', ch)
    
            context = context[1:] + [ix] # crop and add new character
    
    X = torch.tensor(X)
    Y = torch.tensor(Y)
    # print(X)
    print("Shape of X:", X.shape)
    # print(Y)
    print("Shape of Y:", Y.shape)

Output

1
2


    Shape of X: torch.Size([165525, 3])
    Shape of Y: torch.Size([165525])

Se we have 165525 samples, each with 3 characters, and the corresponding label.

Embedding layer

What is embedding? It’s a way to represent a character as a vector. We can use one-hot encoding, but that’s not very efficient. Instead we use embeddings. Encoding a character as a vector allows us to learn the relationship between characters.

For a vocabulary of size 27, we use embeddings of size 2 per character. So each character is represented as a 2-dimensional vector.

1
2
3
4
5
6
7


    num_embeddings_per_char = 2
    C = torch.randn(vocab_size, num_embeddings_per_char)
    # print(C)
    
    # pytorch indexing magic
    emb = C[X]
    print(emb.shape)

Output

1

    torch.Size([165525, 3, 2])

Hidden layer

Since we have 3 inputs, and each input is a 2-dimensional embedding, the input to the hidden layer is 6-dimensional.

We choose the size of the hidden layer to be 100. While optimizing the model, we can experiment with different sizes of the hidden layer, and see how it affects the model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


    hidden_size = 100
    
    # weights and biases
    W1 = torch.randn(block_size*num_embeddings_per_char, hidden_size)
    b1 = torch.randn(hidden_size)
    
    # To multiply the embeddings with the weights, we need to flatten the
	# embeddings first (view(-1, 6)) and then multiply with the weights.
    h = torch.tanh(emb.view(-1, block_size*num_embeddings_per_char) @ W1 + b1)
    print("Hidden layer shape", h.shape)

Output

1

    Hidden layer shape torch.Size([165525, 100])

Output layer

The output has to be a character from the vocabulary. So we have a linear layer with 79 outputs, which is the size of the vocabulary.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


    # forward pass
    W2 = torch.randn(hidden_size, vocab_size)
    b2 = torch.randn(vocab_size)
    
    # get the log probabilities, basically the logits
    logits = h @ W2 + b2
    print(logits.shape)
    
    # exponentiate the counts, basically the softmax
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdim=True)
    
    # first shot at the loss
    loss = -probs[torch.arange(len(Y)), Y].log().mean()
    print(loss.item())

Output

1
2


    torch.Size([165525, 79])
    20.494251251220703

We need to get this loss to be as low as possible. We do this by updating the weights and biases (backprogation). But first lets organize our code so its easier to run multiple invocations of the forward pass, inspect the loss, as well as later on adjust the hyperparameters.

Organizing the code (Parameters)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


    # embeddings
    C = torch.randn((vocab_size, num_embeddings_per_char))
    
    # weights and biases for the hidden layer
    W1 = torch.randn((block_size*num_embeddings_per_char, hidden_size))
    b1 = torch.randn(hidden_size)
    
    # weights and biases for the output layer
    W2 = torch.randn((hidden_size, vocab_size))
    b2 = torch.randn(vocab_size)
    
    parameters = [C, W1, b1, W2, b2]
    
    print("Number of parameters:", sum(p.numel() for p in parameters))

Output

1

    Number of parameters: 8837

So our model has 8837 parameters. Later on as we optimize our model, we may want to adjust the size of the hidden layer, the number of embeddings, etc. This will increase the number of parameters and possible improve the model. And your computation costs! So optimize wisely.

Requires grad

We need to update the weights and biases. We do this by backpropagation. We need to tell PyTorch that we want to update these parameters. We do this by setting the requiresgrad attribute to True. And behind the scenes, PyTorch will keep track of the gradients for us.

1
2


    for p in parameters:
        p.requires_grad = True

Train/validation/test split

If we train our model on our entire dataset, we may overfit. What this means is that our model will be very good at predicting the characters in our dataset, but not so good at predicting characters it has never seen before. To avoid this, we split our dataset into a training, validation, and test set.

The training set is used to optimize the parameters of the model, which is done during our epoch loop. The validation set is used to optimize the hyperparameters of the model, such as the learning rate, the size of the hidden layer, etc. The test set is used to evaluate the model.

Usually we divide the dataset into 80% training, 10% validation, and 10% test.

Build datasets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


    N = len(X)
    idx = torch.randperm(N)
    
    train_idx = idx[:int(0.8*N)]
    val_idx = idx[int(0.8*N):int(0.9*N)]
    test_idx = idx[int(0.9*N):]
    
    X_train, Y_train = X[train_idx], Y[train_idx]
    X_val, Y_val = X[val_idx], Y[val_idx]
    X_test, Y_test = X[test_idx], Y[test_idx]
    
    print(X_train.shape, Y_train.shape)
    print(X_val.shape, Y_val.shape)
    print(X_test.shape, Y_test.shape)

Output

1
2
3


    torch.Size([132420, 3]) torch.Size([132420])
    torch.Size([16552, 3]) torch.Size([16552])
    torch.Size([16553, 3]) torch.Size([16553])

Training loop

Learning rate.

1
2
3


    lre = torch.linspace(-3, 0, 1000) # learning rate exponents
    lrs = 10**lre # learning rates
    # print(lrs)

We now run the below code multiple times to see how the loss decreases. And then execute the training loss block to see the training loss. Since we are decaying the learning rate automatically, we need to manually adjust it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


    for i in range(10000):
    
      # minibatch
      ix = torch.randint(0, X_train.shape[0], (32,))
    
      # forward pass
      emb = C[X_train[ix]]
      h = torch.tanh(emb.view(-1, block_size*num_embeddings_per_char) @ W1 + b1)
      logits = h @ W2 + b2
      loss = F.cross_entropy(logits, Y_train[ix])
      # print(loss.item())
    
      # backward pass
      for p in parameters:
        p.grad = None
    
      loss.backward()
    
      # update the net
    
      # Later on we will optimize the learning rate decay, but for now we
      # just use a fixed learning rate, and manually adjust it.
      # lr = lrs[i]
    
      # we start with 0.1, then 0.05, 0.01 
      lr = 0.1
    
      for p in parameters:
        p.data += -lr * p.grad

Training loss

1
2
3
4
5


    emb = C[X_train]
    h = torch.tanh(emb.view(-1, block_size*num_embeddings_per_char) @ W1 + b1)
    logits = h @ W2 + b2
    loss = F.cross_entropy(logits, Y_train)
    print(loss)

Output

1

    tensor(2.8321, grad_fn=<NllLossBackward0>)

Validation loss

1
2
3
4
5


    emb = C[X_val]
    h = torch.tanh(emb.view(-1, block_size*num_embeddings_per_char) @ W1 + b1)
    logits = h @ W2 + b2
    loss = F.cross_entropy(logits, Y_val)
    print(loss)

Output

1

    tensor(2.8218, grad_fn=<NllLossBackward0>)

Sample from our model

Model inference. We sample from our model to see what it has learned. This is basically a forward pass, but we don’t update the weights.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


    for _ in range(10):
    
      out = []
      context = [0] * block_size # initialize with all ...
    
      while True:
        emb = C[torch.tensor([context])] # (1,block_size,d)
        h = torch.tanh(emb.view(1, -1) @ W1 + b1)
        logits = h @ W2 + b2
        probs = F.softmax(logits, dim=1)
        ix = torch.multinomial(probs, num_samples=1).item()
        context = context[1:] + [ix]
        out.append(ix) if ix != 0 else None
        if ix == 0:
          break
    
      print(''.join(itos[i] for i in out))

Output

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


    Polizv
    Lortu Wivi
    Ceip Bow Geeaslae
    Haehh
    Syoniol Nersore
    Tklinysli
    Ennterm Diidi
    Iled
    M
    Vamone

We can see that the model is generating some words, but they are not very good. We need to optimize the model, and experiment with different hyperparameters to get better results.

This is a good exercise in PyTorch as well as learning how to backpropogate using gradient descent.