How Transformers and LLMs Actually Work - A Developer's Guide with Code

In today's world, AI takes most headlines. It occupies developer minds - in particular large language models. It's scary and exciting at the same time. For a while I have been tinkering with LLMs, I have projects that use LLMs in production and as part of my role I deliver workshops focussing on LLMs.

Since I'm a tinkerer, I was always curious to learn more about how LLMs are created and how they work. And what's better than to build and train my own model. While this may seem like a task that's impossible to do with the resources available at my disposal, I still managed to build a very simplistic and limited model. To make these ideas more tangible, I built a minimal encoder-decoder model that learns country-capital relationships.

Large Language Models

Before we get to that, let's first discuss how the likes of Gemini and Chat GPT work. If I were to simplify things, we could say that any large language model is nothing more than just a highly sophisticated calculator or a next-token prediction engine. Yes, that is really it. Given any input, the model will be able to predict the next, most probable token. After that the next one, then the next one and so on.

What is a token? A small piece of text - like a word, part of a word, or sometimes even just a character, that LLMs read and understand one at a time. You can think of it as units of meaning that the model uses to process and generate language.

Okay, so as mentioned earlier, if there's an input word (or token) the model can predict the most likely next token, using it's training data. I also would like you to understand that the models never actually see text. They see numbers, numerical representations of the tokens, and they do mathematical calculations on these numbers. The model builds on what it has seen in its vast training data to "guess" (predict?) what comes next, one token at a time.

Yes, that's right - token generation happens sequentially. The model looks at the previous tokens and predicts the most likely next token, repeating this step until it finishes the output.

How do models understand language?

So how does a model "understand" language? This is when word embeddings come into play. Word embeddings are high-dimensional vectors (array of numbers) where each token/word is represented as a point in a multi dimensional space. (These multi dimensional spaces can be 1000+ dimensions - very difficult for us humans to even perceive)

Why is this important? Let's think about this for a moment. If I were to give you a task to encode 2 sentences, what would be the most straight forward way of achieving this task? By assigning an ID to each of the words in the sentences.

- The bank approved my loan.
- We sat by the bank of the river.

Token	ID
the	1
bank	2
approved	3
my	4
loan	5
we	6
sat	7
by	8
of	9
river	10

Sentence 1 then becomes: "The bank approved my loan." → `[1, 2, 3, 4, 5]`
Sentence 2 becomes: "We sat by the bank of the river." → `[6, 7, 8, 1, 2, 9, 1, 10]`

The issue here is that word with ID 2 is just bank. But the word bank (financial) and bank (geographical) have semantically different meanings and we should capture that somehow.

Embeddings and Vectors

Representing these terms in higher dimensions allows us to capture the semantic meaning. To demonstrate the difference, let's use a (theoretical) three dimensional vector to assign numbers for the terms in the sentences.

Token	Context	Embedding (x, y, z)
the	–	[0.30, 0.40, 0.20]
bank	financial	[0.81, 0.15, 0.72]
bank	river	[0.12, 0.93, 0.34]
approved	–	[0.78, 0.20, 0.68]
my	–	[0.40, 0.35, 0.30]
loan	–	[0.79, 0.10, 0.70]
we	–	[0.38, 0.55, 0.22]
sat	–	[0.36, 0.60, 0.24]
by	–	[0.32, 0.42, 0.21]
of	–	[0.31, 0.41, 0.19]
river	–	[0.14, 0.91, 0.30]

We can use the simplest distance formula - the Euclidean Distance - to calculate the distances between the terms.

distance = sqrt( (x1 - x2)^2 + (y1 - y2)^2 + (z1 - z2)^2 )

bank (financial) vs loan
bank (financial) = [0.81, 0.15, 0.72]
loan             = [0.79, 0.10, 0.70]

distance = sqrt( (0.81 - 0.79)^2 + (0.15 - 0.10)^2 + (0.72 - 0.70)^2 )
         = sqrt( 0.0004 + 0.0025 + 0.0004 )
         = sqrt( 0.0033 )
         ≈ 0.057

bank (river) = [0.12, 0.93, 0.34]
river        = [0.14, 0.91, 0.30]

distance = sqrt( (0.12 - 0.14)^2 + (0.93 - 0.91)^2 + (0.34 - 0.30)^2 )
         = sqrt( 0.0004 + 0.0004 + 0.0016 )
         = sqrt( 0.0024 )
         ≈ 0.049

bank (financial) = [0.81, 0.15, 0.72]
river            = [0.14, 0.91, 0.30]

distance = sqrt( (0.81 - 0.14)^2 + (0.15 - 0.91)^2 + (0.72 - 0.30)^2 )
         = sqrt( 0.4489 + 0.5776 + 0.1764 )
         = sqrt( 1.2029 )
         ≈ 1.096

Given that each word is a point in space, the closer two points are, the more similar their meanings are and of course the further apart, the more different they are in meaning. Given that bank (financial) <-> loan have a distance of 0.057 it means that they are very similar in meaning so they relate to finance. Similarly bank (river) <-> river have a distance of 0.049 meaning that they both refer to nature. And last but not least bank (financial) <-> river have a distance of 1.096 so they are very different and have unrelated meanings.

Note that modern LLMs use contextual embeddings, which are dynamic and influenced by the entire sentence. So, "bank" won't have one fixed embedding as I show it here but rather it'd have a different, dynamic value for each case/sentence.

Okay, now that we are better understand this we can go back to the previous point of discussion - models use these word embeddings to figure out the probablilities of the next available tokens. It's also important to note that this isn't hard coded to the model, these relationships are learned during the training of the model.

But we are not done yet. Why does this whole thing work? It's all down to the neural architecture behind these models - the transformer architecture. This architecture is really well suited for sequential data - such as text and it processes information in parallel which means a lot faster and more efficient training and inference compared to traditional models such as an RNN (Recurrent Neural Network).

Transformer Architecture (encoder-decoder)

The encoder-decoder architecture works in two stages, and I'm sure you could guess what those two stages are. Encoder/Encoding which is responsible for understanding the input, and the decoder/decoding which generates the output, one token at the time based on the encoder's understanding of the input. We already covered how the input is handled - it's tokenised and vectorised.

What's also important to understand is that the encoder (a stack of layers) look at all tokens at once (this is called self-attention). Self-attention lets it see which words are important to each other, and it does this dynamically for every token. Self-attention computes a weighted sum of all the other tokens, giving higher weight to the more relevant words. Let's see a conceptual walkthough using the sentence the cat sat on the mat.

Every token is transformed into three vectors: Query (Q) - what this token is looking for, Key (K) - what this token offers to others and Value (V) - the information this tokens shares. For every token, compare its Query (Q) with every other token's Key (K) (including itself). This gives a score: how relevant each word is to the current one. We can then apply softmax to turn scores into probabilities (attention weights) and then use these weights to compute a weighted sum of Value (V) vectors. The result of this gives us a new vector: a context-aware representation of the current word. In other words, what we end up having is a set of words in our sentence that have been re-encoded during self-attention and the words get a new vector that summarises not just itself but context from the whole sentence based on how relevant other words are.

Softmax is a function that turns a list of raw numbers into a list of probabilities that are all positive and add up to one. In self-attention, we compute relevance scores for each word. These are just numbers - they can be big, small, positive, or negative. Softmax normalises them into attention weights - how much attention to pay to each word - so the model can take a weighted average.

Let's take the word sat from our sample sentence.

Token	Query ("sat") ⋅ Key	Raw Score	Softmax Weight (Attention)	Value Vector Contribution
the	sat · the	1.2	0.05	0.05 × V(the)
cat	sat · cat	3.5	0.60	0.60 × V(cat)
sat	sat · sat	2.9	0.30	0.30 × V(sat)
on	sat · on	1.1	0.03	0.03 × V(on)
the (2nd)	sat · the	1.0	0.02	0.02 × V(the)
mat	sat · mat	1.3	0.05	0.05 × V(mat)

Based on the above we can see that the word sat "remembers" that cat is likely the subject doing the sitting. Its new vector encodes the relationship without chaning the actual tokens.

All of this really means that the model is now able to predict the next words more logically and understand the sentence structure better.

Okay, so far we understand how tokens are encoded and decoded, however transfomers do not inherently understand the order of tokens, so models intorduce positional embeddings. These are vectors that encode where each word appears in the input sequence. Models add these positional vectors to the word embeddings before processing which helps them infer sentence structure as well. (Note that grammar or structure is learned via attention across these positions)

Training the model

And now we arrive to the training part - the actual training of the neural network. Models are trained on vast amounts of data - books, articles, websites etc using a form of self-supervised learning, where the model learns to predict the next word without needing manually labelled data. The model starts with random weights and improves by minimising the difference between its prediction and the actual next word. Training normally uses backpropagation and stochastic gradient descent, updating billions of parameters. Just to give you an idea GPT-3 has 175 billion parameters and takes months to train.

When we say GPT-3 has 175 billion parameters, we mean it has that many individual numbers, weights and biases, that it learned during training. These are the knobs and dials the model tunes to get better at predicting the next word, based on everything it has seen in its training data. We'll discuss this a little bit later.

Weights and biases

Think of a neural network like a giant machine that takes in numbers (like word embeddings) and transforms them step by step to make a decision - like we discussed: predicting the next word. Inside this machine, two key ingredients are doing most of the work: weights and biases.

Weights try to answer the question "How important is this input?". Imagine you're mixing ingredients to bake a cake: flour, sugar, eggs. But not all ingredients are added equally: you might want more flour than sugar. That's exactly what weights do in a neural network. They control how much influence each input has.

So if a model sees the input: "capital of France is...", it may assign a higher weight to the word "France" than to "of" because "France" is more useful when predicting what comes next.

In technical terms, weights are numbers the model multiplies inputs by. A higher weight means "pay more attention to this". A weight of zero means "ignore it".

Biases on the other hand adjust the result before deciding. Imagine you're adjusting a recipe based on your oven - maybe it's always 10 degrees too cool. You compensate for that - that's a bias.

In a neural network, the bias lets the model shift the result up or down, even if all the inputs stay the same. It gives the network flexibility to make a prediction even when all input weights are low. Without biases, the model might be too rigid.

Let's see how this would work. At each neuron (think: tiny calculator), the model multiplies each input by its weight, adds them up, adds the bias, passes the result through an activation function (e.g. ReLU). In a formula this would look like:

output = activation(w1×x1 + w2×x2 + ... + wn×xn + bias)

Where:

x1, x2,... are inputs (like word embeddings)
w1, w2,... are the weights (importance)
bias is the extra tweak
activation shapes the output (e.g. ReLU = turn negatives into 0)

Now, why does this matter? When a model is trained, it's not learning facts like "Paris is the capital of France" in plain text. It's tweaking millions (or billions) of weights and biases so that, mathematically, when it sees "capital of France", the neuron responsible for "Paris" lights up. In the end, weights and biases are just numbers, but they're what create understanding in the model. They are the knobs it adjusts to get better and better at making predictions.

Code Walkthrough

Now that we understand the core ideas behind tokenisation, embeddings, attention, and transformer mechanics, let's look at how we can put all of that into practice by building a small-scale Transformer model using TensorFlow.js. The goal of this model is simple: given partial information like "The capital of France is..." or "Berlin is...", the model should learn to predict the most likely next words based on its training, just like a miniature version of what large language models do.

Training the model

We begin by defining our vocabulary - a list of all the words the model will recognise and learn from. In our case, this includes common tokens like <pad> (used to fill empty space), <unk> (for unknown words), and <s> and </s> (to mark the beginning and end of a sentence), along with capital cities, countries, and connecting words like "the", "capital", and "of". We assign each word a unique number using a wordToIndex map, which is how the model sees text, not as characters or words, but as numbers it can process.

const vocab = [
  '<pad>',
  '<unk>',
  '<s>',
  '</s>',
  'the',
  'capital',
  'of',
  'is',
  'in',
  'berlin',
  'germany',
  'paris',
  'france',
  'rome',
  'italy',
  'tokyo',
  'japan',
  'madrid',
  'spain',
  'london',
  'uk',
  'athens',
  'greece',
  'vienna',
  'austria',
  'oslo',
  'norway',
  'stockholm',
  'sweden',
  'cairo',
  'egypt',
  'lisbon',
  'portugal',
];

const wordToIndex = Object.fromEntries(vocab.map((w, i) => [w, i]));
const vocabSize = vocab.length;
const maxSeqLen = 6;
const embedDim = 32;
const ffDim = 64;

Next, we define some values known as hyperparameters, these are settings that control how the model behaves and learns. For example, embedDim defines how big our word vectors are. Each word will be represented as a 32-dimensional array of numbers. maxSeqLen sets the maximum number of words in a sentence that our model will process at once (in our case, 6), and ffDim controls the size of a hidden layer we'll use later in the model.

Hyperparameters are settings you choose before training your model, like how big the word vectors are (embedDim) or how many words to process at once (maxSeqLen). They're not learned by the model but they shape how it learns.

We then create two embedding layers: one for the tokens themselves and another for their positions in a sentence. Remember that transformer models don't inherently understand the order of words, they process input all at once. So we add positional embeddings to help the model tell whether a word is at the beginning, middle, or end of a sentence. This is essential for maintaining the structure of language.

const createEmbedding = () =>
  tf.layers.embedding({ inputDim: vocabSize, outputDim: embedDim });

const createPositionalEmbedding = () =>
  tf.layers.embedding({ inputDim: maxSeqLen, outputDim: embedDim });

To prepare the text for training, we use a padSequence function that ensures all input sequences are exactly six tokens long, padding with <pad> if necessary. We also define a oneHot function that turns word indices into one-hot vectors - arrays filled with zeroes except for a single one at the position of the correct word. These are used during training to tell the model which word it should have predicted.

function padSequence(seq) {
  const padded = seq.map((w) => wordToIndex[w] ?? wordToIndex['<unk>']);
  while (padded.length < maxSeqLen) padded.push(wordToIndex['<pad>']);
  return padded;
}

function oneHot(index, size) {
  return Array.from({ length: size }, (_, i) => (i === index ? 1 : 0));
}

The next part defines the core of the transformer model: attention. We implement a basic scaled dot-product attention block that takes in three sets of vectors - queries, keys, and values (remember them from earlier?) - and calculates which words should pay attention to which others. Think of this as the model asking: "For the word I'm currently looking at, which other words in the sentence are relevant, and how much should I care about them?" The attention scores are passed through a softmax function, which turns raw numbers into probabilities that sum to one. This helps the model decide which words to focus on more strongly.

function attentionBlock(q, k, v) {
  const scores = tf.layers.dot({ axes: -1 }).apply([q, k]);
  const scaled = tf.layers.activation({ activation: 'softmax' }).apply(scores);
  return tf.layers.dot({ axes: [2, 1] }).apply([scaled, v]);
}

The encoderBlock function defines how we process the input sentence. It first transforms the inputs into queries, keys, and values using simple dense layers - layers that connect every input to every output. These are followed by the attention mechanism. After attending to other tokens, we add the original input back to the result and normalise it. This is known as a residual connection and layer normalisation, and it helps keep the network stable during training. We then pass the output through a two-layer feedforward network: one layer with a ReLU activation, which turns negative values into zero, followed by another dense layer to bring the result back to the original size. This structure mimics what real transformer encoders do.

A feedforward layer is just a regular neural network layer: it takes inputs, multiplies them with weights, adds a bias, and applies an activation (like ReLU). In Transformers, these layers help process the outputs of attention by transforming them further.

ReLU stands for Rectified Linear Unit. It's an activation function used in neural networks. It takes a number and turns any negative values into zero, while keeping positive values the same. Inside the transformer blocks, there's a feedforward network with a ReLU activation. ReLU is a simple trick that replaces any negative number with zero. It helps the model focus on useful patterns and makes training faster and more stable.

function encoderBlock(input) {
  const q = tf.layers.dense({ units: embedDim }).apply(input);
  const k = tf.layers.dense({ units: embedDim }).apply(input);
  const v = tf.layers.dense({ units: embedDim }).apply(input);
  const attended = attentionBlock(q, k, v);
  const add1 = tf.layers.add().apply([input, attended]);
  const norm1 = tf.layers.layerNormalization().apply(add1);
  const ff = tf.layers.dense({ units: ffDim, activation: 'relu' }).apply(norm1);
  const ffOut = tf.layers.dense({ units: embedDim }).apply(ff);
  const add2 = tf.layers.add().apply([norm1, ffOut]);
  return tf.layers.layerNormalization().apply(add2);
}

Similarly, the decoderBlock handles generating output one token at a time. It includes two attention stages: the first is self-attention, where the decoder attends to its own previously generated tokens; the second is cross-attention, where it attends to the output of the encoder - in other words, it looks back at the original input sentence to help guide the prediction. Like the encoder, it includes feedforward layers and residual connections to maintain stability and learning efficiency.

function decoderBlock(input, encOutput) {
  const selfQ = tf.layers.dense({ units: embedDim }).apply(input);
  const selfK = tf.layers.dense({ units: embedDim }).apply(input);
  const selfV = tf.layers.dense({ units: embedDim }).apply(input);
  const selfAtt = attentionBlock(selfQ, selfK, selfV);
  const selfAdd = tf.layers.add().apply([input, selfAtt]);
  const selfNorm = tf.layers.layerNormalization().apply(selfAdd);

  const crossQ = tf.layers.dense({ units: embedDim }).apply(selfNorm);
  const crossK = tf.layers.dense({ units: embedDim }).apply(encOutput);
  const crossV = tf.layers.dense({ units: embedDim }).apply(encOutput);
  const crossAtt = attentionBlock(crossQ, crossK, crossV);
  const crossAdd = tf.layers.add().apply([selfNorm, crossAtt]);
  const crossNorm = tf.layers.layerNormalization().apply(crossAdd);

  const ff = tf.layers
    .dense({ units: ffDim, activation: 'relu' })
    .apply(crossNorm);
  const ffOut = tf.layers.dense({ units: embedDim }).apply(ff);
  const addOut = tf.layers.add().apply([crossNorm, ffOut]);
  return tf.layers.layerNormalization().apply(addOut);
}

To train the model, we need examples of how input maps to output. We define several training pairs using country–capital data, such as: "berlin is" should result in "the capital of germany", or "the capital of france" should return "paris". Each example is converted into sequences of numbers using our vocabulary map, and then padded and processed into tensors - the raw data format TensorFlow expects. We also prepare the corresponding positional embeddings so the model can understand the order of the words.

A tensor is a multidimensional array. You can think of it as a container that holds a numeric version of our sentences.

const pairs = [
  ['berlin', 'germany'],
  ['paris', 'france'],
  ['rome', 'italy'],
  ['tokyo', 'japan'],
  ['madrid', 'spain'],
  ['london', 'uk'],
  ['athens', 'greece'],
  ['vienna', 'austria'],
  ['oslo', 'norway'],
  ['stockholm', 'sweden'],
  ['cairo', 'egypt'],
  ['lisbon', 'portugal'],
];

const examples = [];

for (const [capital, country] of pairs) {
  examples.push({
    input: [capital, 'is'],
    output: ['<s>', 'the', 'capital', 'of', country, '</s>'],
  });
  examples.push({
    input: ['the', 'capital', 'of', country],
    output: ['<s>', 'is', capital, '</s>'],
  });
  examples.push({
    input: [capital],
    output: ['<s>', 'is', 'the', 'capital', 'of', country, '</s>'].slice(0, 6),
  }); // short input
}

const encoderInputs = [];
const decoderInputs = [];
const decoderLabels = [];

for (const ex of examples) {
  const enc = padSequence(ex.input);
  const decIn = padSequence(ex.output.slice(0, -1));
  const decOut = padSequence(ex.output.slice(1));
  encoderInputs.push(enc);
  decoderInputs.push(decIn);
  decoderLabels.push(decOut.map((i) => oneHot(i, vocabSize)));
}

const xEnc = tf.tensor2d(
  encoderInputs,
  [encoderInputs.length, maxSeqLen],
  'int32'
);
const xDec = tf.tensor2d(
  decoderInputs,
  [decoderInputs.length, maxSeqLen],
  'int32'
);
const y = tf.tensor3d(decoderLabels, [
  decoderLabels.length,
  maxSeqLen,
  vocabSize,
]);

const positionSeq = [...Array(maxSeqLen).keys()];
const xEncPos = tf.tensor2d(
  Array(encoderInputs.length).fill(positionSeq),
  [encoderInputs.length, maxSeqLen],
  'int32'
);
const xDecPos = tf.tensor2d(
  Array(decoderInputs.length).fill(positionSeq),
  [decoderInputs.length, maxSeqLen],
  'int32'
);

Now that our data is numerically encoded (tokenised), we can feed it into the model. The model has two sides - encoder and decoder - and each side needs to know both what the tokens are (via token IDs) and where they are (via position IDs). These inputs are embedded into vector space, and then the model processes them to generate context-aware outputs.

const encInput = tf.input({ shape: [maxSeqLen], dtype: 'int32' });
const encPosInput = tf.input({ shape: [maxSeqLen], dtype: 'int32' });
const decInput = tf.input({ shape: [maxSeqLen], dtype: 'int32' });
const decPosInput = tf.input({ shape: [maxSeqLen], dtype: 'int32' });

const tokEmb = createEmbedding();
const posEmb = createPositionalEmbedding();

const encTokens = tokEmb.apply(encInput);
const encPositions = posEmb.apply(encPosInput);
const encCombined = tf.layers.add().apply([encTokens, encPositions]);
const encOutput = encoderBlock(encCombined);

const decTokens = tokEmb.apply(decInput);
const decPositions = posEmb.apply(decPosInput);
const decCombined = tf.layers.add().apply([decTokens, decPositions]);
const decOutput = decoderBlock(decCombined, encOutput);

So really in the code above we train the model using a sentence pair like "berlin is" and "the capital of germany". We first need to convert both the input and the target into numerical form using our predefined vocabulary. The input sentence ("berlin is") is transformed into a sequence of token IDs (e.g., [7, 6]) and accompanied by corresponding position IDs (e.g., [0, 1]) to indicate word order. Similarly, the target output ("the capital of germany") is prepended with a special start token (<s>) and appended with an end token (</s>), resulting in a sequence like [1, 3, 4, 5, 8, 2], with position IDs [0, 1, 2, 3, 4, 5]. These four inputs - encoder token IDs, encoder position IDs, decoder token IDs, and decoder position IDs - are passed through embedding layers, which convert them into dense vector representations. This setup allows the encoder to process the input meaningfully and enables the decoder to generate the output step by step while attending to the encoder's understanding of the original input.

At the end of the decoder, we use another dense layer with softmax to turn the decoder output into probabilities over all the possible words in the vocabulary. This is where the model makes its final prediction - it chooses the word with the highest probability at each position.

const logits = tf.layers
  .dense({ units: vocabSize, activation: 'softmax' })
  .apply(decOutput);
const model = tf.model({
  inputs: [encInput, encPosInput, decInput, decPosInput],
  outputs: logits,
});

Building on the previous explanation, once the decoder has processed the target sequence and attended to the encoder's representation of the input, it produces a series of context-aware vectors, one for each position in the output sentence. These vectors don't yet represent actual words; they are just high-dimensional summaries of what the model thinks should come next. To convert these vectors into actual word predictions, we pass them through a final dense layer with a softmax activation. This layer transforms each vector into a probability distribution over the entire vocabulary. In simple terms, for every position in the output, the model assigns a likelihood to every possible word it knows. The word with the highest probability is selected as the predicted word for that position. This is the final step in the model's prediction pipeline, where numerical computations are turned back into human language, one word at a time.

We compile the model using the Adam optimiser, which is a popular and efficient algorithm for training neural networks. For the loss function, we use categorical cross-entropy, which is commonly used when we're trying to classify something into one of many possible categories - in this case, picking the correct next word. We train for 300 epochs, which means the model goes through the entire dataset 300 times. Since our dataset is small, this lets it memorise the relationships quite easily.

model.compile({
  optimizer: 'adam',
  loss: 'categoricalCrossentropy',
  metrics: ['accuracy'],
});

console.log('🚀 Training...');
await model.fit([xEnc, xEncPos, xDec, xDecPos], y, {
  epochs: 300,
  verbose: 0,
  callbacks: {
    onEpochEnd: (epoch, logs) => {
      if ((epoch + 1) % 100 === 0) {
        console.log(
          `📊 Epoch ${epoch + 1}: Loss = ${logs.loss.toFixed(
            4
          )} | Accuracy = ${(logs.acc * 100).toFixed(2)}%`
        );
      }
    },
  },
});

Finally, after training, we save the model to disk. This allows us to load it later for inference - for example, asking it what comes after "the capital of greece is" and seeing if it replies "athens". While this model is tiny and only trained on a narrow set of data, it mimics the same steps used in massive LLMs: tokenisation, embedding, attention, and sequence prediction.

await model.save('file://./saved-transformer-model');
console.log('💾 Model saved to ./saved-transformer-model/');

A note on parameters

Do you remember that we discussed parameters before? I mentioned that GPT-3 was trained using 175 billion parameters. Our Transformer model has around 38,000 trainable parameters. That number comes from the different layers in the network - each layer introduces more parameters for the model to learn during training.

First, there's the token embedding layer, which turns each word into a vector. Since our vocabulary has 34 words and each vector has 32 dimensions, that's about 1,088 parameters. Then we have a positional embedding layer to help the model understand the order of words in a sentence - that adds another 192.

Next up are the encoder blocks. We've stacked two of them. Each one has its own attention layers and feedforward network. Together, the two encoders account for roughly 14,000 parameters.

Then come the decoder blocks, also stacked twice. These are a bit heavier because they handle both self-attention and cross-attention (looking at the encoder's output). Combined, the decoders bring in around 20,000 parameters.

Finally, there's a dense output layer that maps the decoder's output to our vocabulary — that's another 1,000 or so parameters.

So: 1k (embeddings) + 192 (positions) + 14k (encoders) + 20k (decoders) + 1k (final output) = ~38,000 total parameters.

It's small compared to models like GPT-3 or Gemini, but structurally it's the same idea — just scaled down to something we can train on our machine in minutes or even seconds instead of months.

Using the model

Now that we have our model ready, we can start using it.

We begin by defining a vocabulary - a list of all the words our model was trained to understand. Each word is assigned a unique ID so we can convert text into numbers, which is what the model actually works with. We create two mappings: one to convert words to IDs, and one to go the other way, turning predictions back into readable words.

const vocab = [
  // same vocab as in the file used for training
];

We redefine the vocabulary and mappings when using the model because the model can only understand numbers - and to use the model properly, we have to speak its language. Even though this setup duplicates what we did while training, it ensures the model receives inputs it recognises and lets us decode its outputs into human language. Without these mappings, the model would be effectively useless at inference time - it would produce IDs we couldn't interpret, or receive inputs it wasn't trained to handle.

Next, we take the input - something like "berlin" or "paris is" - and convert it into a list of token IDs using the vocabulary. If the input is shorter than the expected length (six words), we pad it with a special token so the model receives inputs of consistent size. We also assign a position index to each token to help the model understand word order - something Transformers can't infer on their own.

const wordToIndex = Object.fromEntries(vocab.map((w, i) => [w, i]));
const indexToWord = Object.fromEntries(vocab.map((w, i) => [i, w]));

const maxSeqLen = 6;
const topK = 3;

function padSequence(seq) {
  const padded = seq.map((w) => wordToIndex[w] ?? wordToIndex['<unk>']);
  while (padded.length < maxSeqLen) padded.push(wordToIndex['<pad>']);
  return padded;
}

function positionSequence() {
  return [...Array(maxSeqLen).keys()];
}

We then load the trained model from disk. Once it's ready, we pass the input tokens and positions into the encoder. The decoder then begins generating the output sentence, starting with the special <s> (start) token. At every step, it looks at what it has generated so far, attends to the encoder's output, and predicts the next word.

But instead of always choosing the single most likely word, we use something called top-k sampling. We take the top three most probable words, and pick one based on their relative likelihoods. This keeps the output from becoming repetitive or too rigid - it adds a touch of controlled creativity.

function topKSample(probArray, k = 3) {
  const indexed = probArray.map((p, i) => ({ i, p }));
  const top = indexed.sort((a, b) => b.p - a.p).slice(0, k);
  const total = top.reduce((sum, item) => sum + item.p, 0);
  const rnd = Math.random() * total;

  let cumulative = 0;
  for (const { i, p } of top) {
    cumulative += p;
    if (rnd < cumulative) return i;
  }
  return top[0].i;
}

const model = await tf.loadLayersModel(
  'file://./saved-transformer-model/model.json'
);
console.log('✅ Model loaded.');

At each decoding step, we log the top five candidate words and their probabilities, so we can see what the model considered. We repeat this process until the model produces a stop token () or reaches the maximum length.

async function generate(inputWords) {
  const encIds = padSequence(inputWords);
  const encPos = positionSequence();

  const encInput = tf.tensor2d([encIds], [1, maxSeqLen], 'int32');
  const encPosTensor = tf.tensor2d([encPos], [1, maxSeqLen], 'int32');

  let outputWords = ['<s>'];

  for (let i = 0; i < maxSeqLen - 1; i++) {
    const decIds = padSequence(outputWords);
    const decPos = positionSequence();

    const decInput = tf.tensor2d([decIds], [1, maxSeqLen], 'int32');
    const decPosTensor = tf.tensor2d([decPos], [1, maxSeqLen], 'int32');

    const prediction = model.predict([
      encInput,
      encPosTensor,
      decInput,
      decPosTensor,
    ]);
    const probs = await prediction.array();

    const topProbs = probs[0][i]
      .map((p, idx) => ({ word: indexToWord[idx], prob: p }))
      .sort((a, b) => b.prob - a.prob)
      .slice(0, 5);

    const contextText = outputWords
      .map((tok, i) => (tok === '<s>' ? `<s = ${inputWords.join(' ')}>` : tok))
      .join(' ');

    console.log(`\n🔢 Step ${i + 1} - context: "${contextText}"`);

    console.table(
      topProbs.map((t) => ({
        Token: t.word,
        Probability: (t.prob * 100).toFixed(2) + '%',
      }))
    );

    const nextIdx = topKSample(probs[0][i], topK);
    const nextWord = indexToWord[nextIdx];

    if (nextWord === '</s>' || nextWord === '<pad>') break;
    outputWords.push(nextWord);
  }

  console.log(`🧪 Input: "${inputWords.join(' ')}"`);
  console.log(`💬 Output: ${outputWords.slice(1).join(' ')}`);
}

In the end, we print both the input and the generated output - showing how, from just a word or two, the model constructs a full response using everything it learned during training. Here are a couple examples:

await generate(['paris', 'is']);
await generate(['the', 'capital', 'of', 'sweden']);
await generate(['berlin']);
await generate(['oslo', 'is']);
await generate(['the', 'capital', 'of', 'portugal']);

And their output:

✅ Model loaded.

🔢 Step 1 - context: "<s = paris is>"
🎲 Random number for sampling: 0.012114 (out of 0.571280)
┌─────────┬──────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token    │ Probability │ Cumulative Range       │ Chosen │
├─────────┼──────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'the'    │ '35.03%'    │ '[0.000000, 0.350256]' │ '✅'   │
│ 1       │ 'of'     │ '16.97%'    │ '[0.350256, 0.519941]' │ ''     │
│ 2       │ 'berlin' │ '5.13%'     │ '[0.519941, 0.571280]' │ ''     │
└─────────┴──────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.012114 fell between 0.000000 and 0.350256 → chose "the"

🔢 Step 2 - context: "<s = paris is> the"
🎲 Random number for sampling: 0.834674 (out of 0.869655)
┌─────────┬───────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token     │ Probability │ Cumulative Range       │ Chosen │
├─────────┼───────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'capital' │ '84.31%'    │ '[0.000000, 0.843096]' │ '✅'   │
│ 1       │ 'berlin'  │ '1.41%'     │ '[0.843096, 0.857172]' │ ''     │
│ 2       │ 'germany' │ '1.25%'     │ '[0.857172, 0.869655]' │ ''     │
└─────────┴───────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.834674 fell between 0.000000 and 0.843096 → chose "capital"

🔢 Step 3 - context: "<s = paris is> the capital"
🎲 Random number for sampling: 0.008023 (out of 0.903869)
┌─────────┬──────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token    │ Probability │ Cumulative Range       │ Chosen │
├─────────┼──────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'of'     │ '89.43%'    │ '[0.000000, 0.894272]' │ '✅'   │
│ 1       │ 'italy'  │ '0.49%'     │ '[0.894272, 0.899174]' │ ''     │
│ 2       │ 'france' │ '0.47%'     │ '[0.899174, 0.903869]' │ ''     │
└─────────┴──────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.008023 fell between 0.000000 and 0.894272 → chose "of"

🔢 Step 4 - context: "<s = paris is> the capital of"
🎲 Random number for sampling: 0.405648 (out of 0.777616)
┌─────────┬───────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token     │ Probability │ Cumulative Range       │ Chosen │
├─────────┼───────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'france'  │ '72.12%'    │ '[0.000000, 0.721226]' │ '✅'   │
│ 1       │ 'greece'  │ '2.86%'     │ '[0.721226, 0.749837]' │ ''     │
│ 2       │ 'capital' │ '2.78%'     │ '[0.749837, 0.777616]' │ ''     │
└─────────┴───────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.405648 fell between 0.000000 and 0.721226 → chose "france"

🔢 Step 5 - context: "<s = paris is> the capital of france"
🎲 Random number for sampling: 0.235517 (out of 0.908354)
┌─────────┬─────────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token       │ Probability │ Cumulative Range       │ Chosen │
├─────────┼─────────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ '</s>'      │ '90.05%'    │ '[0.000000, 0.900479]' │ '✅'   │
│ 1       │ 'vienna'    │ '0.39%'     │ '[0.900479, 0.904427]' │ ''     │
│ 2       │ 'stockholm' │ '0.39%'     │ '[0.904427, 0.908354]' │ ''     │
└─────────┴─────────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.235517 fell between 0.000000 and 0.900479 → chose "</s>"
🧪 Input: "paris is"
💬 Output: the capital of france

🔢 Step 1 - context: "<s = the capital of sweden>"
🎲 Random number for sampling: 0.801154 (out of 0.904613)
┌─────────┬──────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token    │ Probability │ Cumulative Range       │ Chosen │
├─────────┼──────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'is'     │ '89.53%'    │ '[0.000000, 0.895324]' │ '✅'   │
│ 1       │ '</s>'   │ '0.47%'     │ '[0.895324, 0.900042]' │ ''     │
│ 2       │ 'lisbon' │ '0.46%'     │ '[0.900042, 0.904613]' │ ''     │
└─────────┴──────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.801154 fell between 0.000000 and 0.895324 → chose "is"

🔢 Step 2 - context: "<s = the capital of sweden> is"
🎲 Random number for sampling: 0.397214 (out of 0.743591)
┌─────────┬─────────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token       │ Probability │ Cumulative Range       │ Chosen │
├─────────┼─────────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'stockholm' │ '67.37%'    │ '[0.000000, 0.673713]' │ '✅'   │
│ 1       │ 'is'        │ '4.60%'     │ '[0.673713, 0.719663]' │ ''     │
│ 2       │ '</s>'      │ '2.39%'     │ '[0.719663, 0.743591]' │ ''     │
└─────────┴─────────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.397214 fell between 0.000000 and 0.673713 → chose "stockholm"

🔢 Step 3 - context: "<s = the capital of sweden> is stockholm"
🎲 Random number for sampling: 0.682626 (out of 0.911400)
┌─────────┬──────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token    │ Probability │ Cumulative Range       │ Chosen │
├─────────┼──────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ '</s>'   │ '90.29%'    │ '[0.000000, 0.902881]' │ '✅'   │
│ 1       │ 'sweden' │ '0.44%'     │ '[0.902881, 0.907266]' │ ''     │
│ 2       │ 'tokyo'  │ '0.41%'     │ '[0.907266, 0.911400]' │ ''     │
└─────────┴──────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.682626 fell between 0.000000 and 0.902881 → chose "</s>"
🧪 Input: "the capital of sweden"
💬 Output: is stockholm

🔢 Step 1 - context: "<s = the capital of sweden>"
🎲 Random number for sampling: 0.801154 (out of 0.904613)
┌─────────┬──────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token    │ Probability │ Cumulative Range       │ Chosen │
├─────────┼──────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'is'     │ '89.53%'    │ '[0.000000, 0.895324]' │ '✅'   │
│ 1       │ '</s>'   │ '0.47%'     │ '[0.895324, 0.900042]' │ ''     │
│ 2       │ 'lisbon' │ '0.46%'     │ '[0.900042, 0.904613]' │ ''     │
└─────────┴──────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.801154 fell between 0.000000 and 0.895324 → chose "is"

🔢 Step 2 - context: "<s = the capital of sweden> is"
🎲 Random number for sampling: 0.397214 (out of 0.743591)
┌─────────┬─────────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token       │ Probability │ Cumulative Range       │ Chosen │
├─────────┼─────────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'stockholm' │ '67.37%'    │ '[0.000000, 0.673713]' │ '✅'   │
│ 1       │ 'is'        │ '4.60%'     │ '[0.673713, 0.719663]' │ ''     │
│ 2       │ '</s>'      │ '2.39%'     │ '[0.719663, 0.743591]' │ ''     │
└─────────┴─────────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.397214 fell between 0.000000 and 0.673713 → chose "stockholm"

🔢 Step 3 - context: "<s = the capital of sweden> is stockholm"
🎲 Random number for sampling: 0.682626 (out of 0.911400)
┌─────────┬──────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token    │ Probability │ Cumulative Range       │ Chosen │
├─────────┼──────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ '</s>'   │ '90.29%'    │ '[0.000000, 0.902881]' │ '✅'   │
│ 1       │ 'sweden' │ '0.44%'     │ '[0.902881, 0.907266]' │ ''     │
│ 2       │ 'tokyo'  │ '0.41%'     │ '[0.907266, 0.911400]' │ ''     │
└─────────┴──────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.682626 fell between 0.000000 and 0.902881 → chose "</s>"
🧪 Input: "the capital of sweden"
💬 Output: is stockholm

🔢 Step 1 - context: "<s = berlin>"
🎲 Random number for sampling: 0.360943 (out of 0.550667)
┌─────────┬───────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token │ Probability │ Cumulative Range       │ Chosen │
├─────────┼───────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'of'  │ '44.42%'    │ '[0.000000, 0.444223]' │ '✅'   │
│ 1       │ 'is'  │ '6.63%'     │ '[0.444223, 0.510516]' │ ''     │
│ 2       │ 'the' │ '4.02%'     │ '[0.510516, 0.550667]' │ ''     │
└─────────┴───────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.360943 fell between 0.000000 and 0.444223 → chose "of"

🔢 Step 2 - context: "<s = berlin> of"
🎲 Random number for sampling: 0.028323 (out of 0.347061)
┌─────────┬───────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token     │ Probability │ Cumulative Range       │ Chosen │
├─────────┼───────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'capital' │ '13.13%'    │ '[0.000000, 0.131281]' │ '✅'   │
│ 1       │ 'tokyo'   │ '12.96%'    │ '[0.131281, 0.260839]' │ ''     │
│ 2       │ 'germany' │ '8.62%'     │ '[0.260839, 0.347061]' │ ''     │
└─────────┴───────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.028323 fell between 0.000000 and 0.131281 → chose "capital"

🔢 Step 3 - context: "<s = berlin> of capital"
🎲 Random number for sampling: 0.278225 (out of 0.908782)
┌─────────┬────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token  │ Probability │ Cumulative Range       │ Chosen │
├─────────┼────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'of'   │ '89.67%'    │ '[0.000000, 0.896729]' │ '✅'   │
│ 1       │ '</s>' │ '0.74%'     │ '[0.896729, 0.904131]' │ ''     │
│ 2       │ 'oslo' │ '0.47%'     │ '[0.904131, 0.908782]' │ ''     │
└─────────┴────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.278225 fell between 0.000000 and 0.896729 → chose "of"

🔢 Step 4 - context: "<s = berlin> of capital of"
🎲 Random number for sampling: 0.336735 (out of 0.475670)
┌─────────┬───────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token     │ Probability │ Cumulative Range       │ Chosen │
├─────────┼───────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'of'      │ '31.93%'    │ '[0.000000, 0.319285]' │ ''     │
│ 1       │ 'germany' │ '8.70%'     │ '[0.319285, 0.406237]' │ '✅'   │
│ 2       │ 'austria' │ '6.94%'     │ '[0.406237, 0.475670]' │ ''     │
└─────────┴───────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.336735 fell between 0.319285 and 0.406237 → chose "germany"

🔢 Step 5 - context: "<s = berlin> of capital of germany"
🎲 Random number for sampling: 0.101130 (out of 0.565302)
┌─────────┬───────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token     │ Probability │ Cumulative Range       │ Chosen │
├─────────┼───────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ '</s>'    │ '44.04%'    │ '[0.000000, 0.440358]' │ '✅'   │
│ 1       │ 'capital' │ '8.51%'     │ '[0.440358, 0.525490]' │ ''     │
│ 2       │ 'italy'   │ '3.98%'     │ '[0.525490, 0.565302]' │ ''     │
└─────────┴───────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.101130 fell between 0.000000 and 0.440358 → chose "</s>"
🧪 Input: "berlin"
💬 Output: of capital of germany

🔢 Step 1 - context: "<s = oslo is>"
🎲 Random number for sampling: 0.094870 (out of 0.567237)
┌─────────┬──────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token    │ Probability │ Cumulative Range       │ Chosen │
├─────────┼──────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'the'    │ '32.31%'    │ '[0.000000, 0.323079]' │ '✅'   │
│ 1       │ 'of'     │ '19.87%'    │ '[0.323079, 0.521800]' │ ''     │
│ 2       │ 'norway' │ '4.54%'     │ '[0.521800, 0.567237]' │ ''     │
└─────────┴──────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.094870 fell between 0.000000 and 0.323079 → chose "the"

🔢 Step 2 - context: "<s = oslo is> the"
🎲 Random number for sampling: 0.552418 (out of 0.875596)
┌─────────┬───────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token     │ Probability │ Cumulative Range       │ Chosen │
├─────────┼───────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'capital' │ '85.08%'    │ '[0.000000, 0.850830]' │ '✅'   │
│ 1       │ 'germany' │ '1.38%'     │ '[0.850830, 0.864673]' │ ''     │
│ 2       │ 'berlin'  │ '1.09%'     │ '[0.864673, 0.875596]' │ ''     │
└─────────┴───────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.552418 fell between 0.000000 and 0.850830 → chose "capital"

🔢 Step 3 - context: "<s = oslo is> the capital"
🎲 Random number for sampling: 0.038314 (out of 0.905547)
┌─────────┬─────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token   │ Probability │ Cumulative Range       │ Chosen │
├─────────┼─────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'of'    │ '89.62%'    │ '[0.000000, 0.896230]' │ '✅'   │
│ 1       │ 'italy' │ '0.51%'     │ '[0.896230, 0.901294]' │ ''     │
│ 2       │ 'spain' │ '0.43%'     │ '[0.901294, 0.905547]' │ ''     │
└─────────┴─────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.038314 fell between 0.000000 and 0.896230 → chose "of"

🔢 Step 4 - context: "<s = oslo is> the capital of"
🎲 Random number for sampling: 0.259886 (out of 0.901292)
┌─────────┬───────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token     │ Probability │ Cumulative Range       │ Chosen │
├─────────┼───────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'norway'  │ '88.64%'    │ '[0.000000, 0.886383]' │ '✅'   │
│ 1       │ 'germany' │ '0.75%'     │ '[0.886383, 0.893883]' │ ''     │
│ 2       │ 'capital' │ '0.74%'     │ '[0.893883, 0.901292]' │ ''     │
└─────────┴───────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.259886 fell between 0.000000 and 0.886383 → chose "norway"

🔢 Step 5 - context: "<s = oslo is> the capital of norway"
🎲 Random number for sampling: 0.054036 (out of 0.907924)
┌─────────┬──────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token    │ Probability │ Cumulative Range       │ Chosen │
├─────────┼──────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ '</s>'   │ '89.95%'    │ '[0.000000, 0.899547]' │ '✅'   │
│ 1       │ 'greece' │ '0.42%'     │ '[0.899547, 0.903773]' │ ''     │
│ 2       │ 'spain'  │ '0.42%'     │ '[0.903773, 0.907924]' │ ''     │
└─────────┴──────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.054036 fell between 0.000000 and 0.899547 → chose "</s>"
🧪 Input: "oslo is"
💬 Output: the capital of norway

🔢 Step 1 - context: "<s = the capital of portugal>"
🎲 Random number for sampling: 0.541763 (out of 0.904346)
┌─────────┬──────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token    │ Probability │ Cumulative Range       │ Chosen │
├─────────┼──────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'is'     │ '89.42%'    │ '[0.000000, 0.894217]' │ '✅'   │
│ 1       │ 'oslo'   │ '0.54%'     │ '[0.894217, 0.899570]' │ ''     │
│ 2       │ 'lisbon' │ '0.48%'     │ '[0.899570, 0.904346]' │ ''     │
└─────────┴──────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.541763 fell between 0.000000 and 0.894217 → chose "is"

🔢 Step 2 - context: "<s = the capital of portugal> is"
🎲 Random number for sampling: 0.400436 (out of 0.616266)
┌─────────┬──────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token    │ Probability │ Cumulative Range       │ Chosen │
├─────────┼──────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ 'is'     │ '36.51%'    │ '[0.000000, 0.365114]' │ ''     │
│ 1       │ 'lisbon' │ '21.13%'    │ '[0.365114, 0.576455]' │ '✅'   │
│ 2       │ 'greece' │ '3.98%'     │ '[0.576455, 0.616266]' │ ''     │
└─────────┴──────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.400436 fell between 0.365114 and 0.576455 → chose "lisbon"

🔢 Step 3 - context: "<s = the capital of portugal> is lisbon"
🎲 Random number for sampling: 0.001616 (out of 0.903601)
┌─────────┬──────────┬─────────────┬────────────────────────┬────────┐
│ (index) │ Token    │ Probability │ Cumulative Range       │ Chosen │
├─────────┼──────────┼─────────────┼────────────────────────┼────────┤
│ 0       │ '</s>'   │ '89.48%'    │ '[0.000000, 0.894805]' │ '✅'   │
│ 1       │ 'lisbon' │ '0.45%'     │ '[0.894805, 0.899295]' │ ''     │
│ 2       │ '<unk>'  │ '0.43%'     │ '[0.899295, 0.903601]' │ ''     │
└─────────┴──────────┴─────────────┴────────────────────────┴────────┘
🔍 Random sample 0.001616 fell between 0.000000 and 0.894805 → chose "</s>"
🧪 Input: "the capital of portugal"
💬 Output: is lisbon

Conclusion

And there you have it - a simple, hands-on walkthrough of how a transformer model works, built entirely from scratch using TensorFlow.js. While our model only learns to map country names to their capitals with a few dozen training examples, the core mechanics - tokenisation, embeddings, attention, and encoder-decoder architecture - are the same ones powering the massive language models behind today's AI revolution. Now, imagine scaling this up: instead of a dozen sentences, think of training on billions of text samples from books, articles, websites, forums, and more. And instead of a few thousand weights, think of a model with hundreds of billions of parameters - all fine-tuned through endless iterations, constantly adjusting how each token relates to the next based on statistical patterns in language. That's the world of models like GPT and Gemini.

What we built is a miniature replica of that machinery. The difference is size, data, and scale - not principle. For instance, GPT (Generative Pretrained Transformer) is essentially a decoder-only architecture. It takes a string of input tokens and tries to predict what comes next, one token at a time. (Note that during training it predicts the entire output sequence in parallel, but during generation it is autoregressive - produces one token at a time.) There's no separate encoder as in the traditional encoder-decoder architecture; instead, it learns to model the world purely through self-attention over previously generated tokens. This makes it great for continuation tasks - like chat, writing, or code completion - where the goal is to keep generating coherent language from left to right. Gemini and other models built on encoder-decoder structures, on the other hand, often process inputs more holistically, making them better suited for tasks that require strong comprehension and multimodal understanding - like summarisation, translation, or combining text with images or structured data.

In both cases, though, the foundational ideas remain remarkably similar to what we've just explored. Transformers don't understand language the way we do - they calculate relationships between tokens, capture patterns through training, and optimise billions of small weight updates to get better at guessing what comes next. And yet, through this brute-force statistical pattern recognition, they can generate poetry, explain quantum mechanics, and even write working code. Hopefully, this journey into building a tiny Transformer has demystified the technology just a little - showing that behind the seemingly magical capabilities of LLMs lies a deeply logical and elegant structure, one that's now within reach for curious developers everywhere.