Remember our journey so far? We started with simple Markov chains showing how statistical word prediction works, then dove into the core concepts of word embeddings, self-attention, and next word prediction. Now, it’s time for the grand finale: if you want to build your own working transformer language model in R, read on!
You will say, no way!?! But yes, according to the mantra that you have only understood what you have built yourself from scratch, we will create a mini-ChatGPT that learns to write like “Alice in Wonderland” and the “Wizard of Oz”!
What we’ve learned so far:
A transformer combines ALL of these concepts into one powerful architecture. Think of it as a sophisticated Markov chain that doesn’t just look at the previous few words, but can attend to any word in the entire context, understanding relationships and patterns across the whole text!
Let’s build a complete transformer step by step, using the same alice_oz.txt file from our Markov chain example:
library(torch) # install from CRAN # Create word-level tokenizer create_tokenizer <- function(text) { text <- tolower(text) words <- unlist(strsplit(text, "\\s+")) words <- words[words != ""] unique_words <- sort(unique(words)) vocab <- c("<start>", "<end>", unique_words) word_to_idx <- setNames(seq_along(vocab), vocab) idx_to_word <- setNames(vocab, seq_along(vocab)) list(word_to_idx = word_to_idx, idx_to_word = idx_to_word, vocab_size = length(vocab)) }
Unlike our Markov chain that worked with fixed N-grams, this tokenizer prepares words for our transformer to process entire sequences.
transformer_layer <- nn_module( initialize = function(d_model, n_heads) { self$d_model <- d_model self$n_heads <- n_heads self$d_k <- d_model %/% n_heads # The Q, K, V matrices for the attention mechanism self$w_q <- nn_linear(d_model, d_model, bias = FALSE) self$w_k <- nn_linear(d_model, d_model, bias = FALSE) self$w_v <- nn_linear(d_model, d_model, bias = FALSE) self$w_o <- nn_linear(d_model, d_model) # Feed-forward neural network self$ff <- nn_sequential( nn_linear(d_model, d_model * 4), nn_relu(), nn_linear(d_model * 4, d_model) ) self$ln1 <- nn_layer_norm(d_model) self$ln2 <- nn_layer_norm(d_model) self$dropout <- nn_dropout(0.1) }, forward = function(x, mask = NULL) { # Multi-head self-attention (exactly like our simple example, but multi-headed!) batch_size <- x$size(1) seq_len <- x$size(2) q <- self$w_q(x)$view(c(batch_size, seq_len, self$n_heads, self$d_k))$transpose(2, 3) k <- self$w_k(x)$view(c(batch_size, seq_len, self$n_heads, self$d_k))$transpose(2, 3) v <- self$w_v(x)$view(c(batch_size, seq_len, self$n_heads, self$d_k))$transpose(2, 3) # Scaled dot-product attention scores <- torch_matmul(q, k$transpose(-2, -1)) / sqrt(self$d_k) if (!is.null(mask)) { scores <- scores + mask$unsqueeze(1)$unsqueeze(1) } attn_weights <- nnf_softmax(scores, dim = -1) attn_output <- torch_matmul(attn_weights, v) # Combine heads and apply output projection attn_output <- attn_output$transpose(2, 3)$contiguous()$view(c(batch_size, seq_len, self$d_model)) attn_output <- self$w_o(attn_output) # Residual connection and layer norm x <- self$ln1(x + self$dropout(attn_output)) # Feed-forward ff_output <- self$ff(x) x <- self$ln2(x + self$dropout(ff_output)) x } )
This is our self-attention mechanism in action! Just like in our simple 3×3 example, but now it works with entire sequences and multiple attention heads.
toy_llm <- nn_module( initialize = function(vocab_size, d_model = 256, n_heads = 8, n_layers = 4) { # Word embeddings (remember our love/is/wonderful example?) self$token_embedding <- nn_embedding(vocab_size, d_model) self$pos_encoding <- create_positional_encoding(512, d_model, "cpu") # Stack of transformer layers self$transformer_layer_1 <- transformer_layer(d_model, n_heads) if (n_layers >= 2) self$transformer_layer_2 <- transformer_layer(d_model, n_heads) if (n_layers >= 3) self$transformer_layer_3 <- transformer_layer(d_model, n_heads) if (n_layers >= 4) self$transformer_layer_4 <- transformer_layer(d_model, n_heads) self$n_layers <- n_layers # Output projection (back to vocabulary) self$ln_f <- nn_layer_norm(d_model) self$lm_head <- nn_linear(d_model, vocab_size) self$dropout <- nn_dropout(0.1) }, forward = function(x) { seq_len <- x$size(2) # Causal mask (no peeking at future words!) mask <- torch_triu(torch_ones(seq_len, seq_len, device = x$device), diagonal = 1) mask <- mask$masked_fill(mask == 1, -Inf) # Token embeddings + positional encoding x <- self$token_embedding(x) * sqrt(self$d_model) pos_enc <- self$pos_encoding[1:seq_len, ]$to(device = x$device) x <- x + pos_enc x <- self$dropout(x) # Pass through transformer layers x <- self$transformer_layer_1(x, mask) if (self$n_layers >= 2) x <- self$transformer_layer_2(x, mask) if (self$n_layers >= 3) x <- self$transformer_layer_3(x, mask) if (self$n_layers >= 4) x <- self$transformer_layer_4(x, mask) # Final layer norm and projection to vocabulary x <- self$ln_f(x) logits <- self$lm_head(x) logits } )
This is the core of the LLM, the transformer. This neural network architecture makes use of all of the above concepts, like embeddings, attention, and next word prediction!
Now comes the magic – training our transformer on Alice in Wonderland and the Wizard of Oz:
# Load the same text from our Markov chain example txt <- readLines(url("http://paulo-jorente.de/text/alice_oz.txt"), warn = FALSE) training_text <- paste(txt, collapse = " ") training_text <- gsub("[^a-zA-Z0-9 .,!?;:-]", "", training_text) training_text <- tolower(training_text) # Create tokenizer and model tokenizer <- create_tokenizer(training_text) model <- toy_llm(vocab_size = tokenizer$vocab_size, d_model = 256, n_heads = 8, n_layers = 4) # Train the model (this is where the magic happens!) train_model(model, training_text, tokenizer, epochs = 1500, seq_len = 32, batch_size = 4)
After training, our mini-transformer produces text like this:
Prompt ‘alice’: alice looked down at them, and considered a little before she was going to shrink in the time and round the
Prompt ‘the queen’: the queen said to the executioner: fetch her here. and the executioner went off like an arrow. the cats head began fading
Prompt ‘down the’: down the chimney, and she said to herself now i can do no more, whatever happens. what will become of me? luckily
Compare this to our original Markov chain output:
anxious returned the Scarecrow It is such an uncomfortable feeling to know one is a crow or a man After the crows had gone I thought this over and decided
The transformer has learned:
Unlike our Markov chain that only looked at the previous 2-3 words, our transformer can:
What we built is essentially a miniature version of ChatGPT! The same principles scale up:
But the core architecture? Exactly the same!
What’s truly remarkable is that this simple architecture – predicting the next word using self-attention – gives rise to seemingly intelligent behavior. Our tiny model learned:
All from the simple task of “predict the next word”!
Isn’t it fascinating that so much apparently intelligent behavior emerges from statistical text prediction? As we saw in our Markov chain post, “many tasks that demand human-level intelligence can obviously be reduced to some form of (statistical) text prediction with a sufficiently performant model!”
To give you an intuition, why using a neural network architecture for this is so powerful: we have already seen that neural networks build a representation of their world, a world model (see: Understanding the Magic of Neural Networks). In this case, imagine a detective story which ends with “And now it was clear, the murderer was…”: to sensibly predict the next (and last) word the neural network really must have understood the story in some sense!
You’ve now built your own language model using the same principles as ChatGPT! Next, we could experimenting with:
Remember: we’ve just implemented the core technology behind the AI revolution. From Markov chains to attention mechanisms to transformers – you’ve mastered the journey from simple statistics to artificial intelligence!
The next time someone asks you “How does ChatGPT work?”, you can confidently say: “Let me show you…” and build one from scratch (or show this post )!