Skip to main content

How Transformers and LLMs Actually Work - A Developer's Guide with Code

23 min read

AI dominates the headlines. LLMs occupy developer minds. It’s scary and exciting at the same time.

I’ve been tinkering with LLMs for a while, shipping projects that use them in production and delivering workshops focused on how they work. But I wanted to go deeper. I wanted to understand how these models are actually built.

So I built and trained my own. A tiny one, granted (I don’t have a data centre in the garden), but structurally identical to the real thing. To make the ideas tangible, I built a minimal encoder-decoder model that learns country-capital relationships.

Large Language Models

Let’s strip this back. Any large language model, whether it’s Gemini or ChatGPT, is fundamentally a highly sophisticated next-token prediction engine. Given any input, the model predicts the most probable next token. Then the next one. Then the next one. That’s it.

What is a token? A small piece of text, like a word, part of a word, or sometimes even just a character, that LLMs read and process one at a time. Think of tokens as units of meaning.

The model never actually sees text. It sees numbers: numerical representations of tokens. It performs mathematical calculations on those numbers, building on patterns from its vast training data to predict what comes next, one token at a time.

Token generation happens sequentially. The model looks at the previous tokens, predicts the most likely next one, and repeats this step until it finishes the output.

How do models understand language?

Word embeddings are the key. They’re high-dimensional vectors (arrays of numbers) where each token is represented as a point in multi-dimensional space. These spaces can span 1,000+ dimensions, which is nearly impossible for us to visualise.

Why does this matter? Consider the simplest approach to encoding two sentences: assigning an ID to each word.

- The bank approved my loan.
- We sat by the bank of the river.
TokenID
the1
bank2
approved3
my4
loan5
we6
sat7
by8
of9
river10
Sentence 1 then becomes: "The bank approved my loan." → `[1, 2, 3, 4, 5]`
Sentence 2 becomes: "We sat by the bank of the river." → `[6, 7, 8, 1, 2, 9, 1, 10]`

The problem: word ID 2 is just bank. But “bank” (financial institution) and “bank” (riverbank) have completely different semantic meanings. A flat ID captures none of that.

Embeddings and Vectors

Representing terms in higher dimensions lets us capture semantic meaning. Here’s a (theoretical) three-dimensional vector assignment:

TokenContextEmbedding (x, y, z)
the[0.30, 0.40, 0.20]
bankfinancial[0.81, 0.15, 0.72]
bankriver[0.12, 0.93, 0.34]
approved[0.78, 0.20, 0.68]
my[0.40, 0.35, 0.30]
loan[0.79, 0.10, 0.70]
we[0.38, 0.55, 0.22]
sat[0.36, 0.60, 0.24]
by[0.32, 0.42, 0.21]
of[0.31, 0.41, 0.19]
river[0.14, 0.91, 0.30]

We can use Euclidean Distance to calculate how close these terms sit in vector space:

distance = sqrt( (x1 - x2)^2 + (y1 - y2)^2 + (z1 - z2)^2 )

bank (financial) vs loan
bank (financial) = [0.81, 0.15, 0.72]
loan             = [0.79, 0.10, 0.70]

distance = sqrt( (0.81 - 0.79)^2 + (0.15 - 0.10)^2 + (0.72 - 0.70)^2 )
         = sqrt( 0.0004 + 0.0025 + 0.0004 )
         = sqrt( 0.0033 )
         ≈ 0.057
bank (river) = [0.12, 0.93, 0.34]
river        = [0.14, 0.91, 0.30]

distance = sqrt( (0.12 - 0.14)^2 + (0.93 - 0.91)^2 + (0.34 - 0.30)^2 )
         = sqrt( 0.0004 + 0.0004 + 0.0016 )
         = sqrt( 0.0024 )
         ≈ 0.049
bank (financial) = [0.81, 0.15, 0.72]
river            = [0.14, 0.91, 0.30]

distance = sqrt( (0.81 - 0.14)^2 + (0.15 - 0.91)^2 + (0.72 - 0.30)^2 )
         = sqrt( 0.4489 + 0.5776 + 0.1764 )
         = sqrt( 1.2029 )
         ≈ 1.096

Closer points in this space mean more similar meanings. bank (financial) <-> loan at 0.057 tells us they’re semantically related (finance). bank (river) <-> river at 0.049 tells us they both relate to nature. And bank (financial) <-> river at 1.096 confirms they’re semantically distant.

Modern LLMs use contextual embeddings, which are dynamic and influenced by the entire sentence. So “bank” won’t have one fixed embedding; it gets a different value depending on the surrounding context.

Models use these word embeddings to figure out the probabilities of next available tokens. These relationships aren’t hard-coded. They’re learned during training.

But there’s more to the story. The whole thing works because of the neural architecture behind these models: the transformer architecture. It’s well suited for sequential data like text, and it processes information in parallel, making training and inference far faster than traditional models like RNNs (Recurrent Neural Networks).

Transformer Architecture (encoder-decoder)

The encoder-decoder architecture works in two stages. The encoder understands the input. The decoder generates the output, one token at a time, based on the encoder’s understanding. We’ve already covered how input is handled: tokenised and vectorised.

The encoder (a stack of layers) looks at all tokens simultaneously. This is called self-attention. It lets the model see which words matter to each other, dynamically, for every token. Self-attention computes a weighted sum of all other tokens, giving higher weight to the most relevant words.

Here’s a conceptual walkthrough using the sentence the cat sat on the mat.

Every token is transformed into three vectors: Query (Q) (what this token is looking for), Key (K) (what this token offers to others), and Value (V) (the information this token shares). For every token, its Query is compared against every other token’s Key (including itself), producing a relevance score. Softmax turns these scores into probabilities (attention weights), and these weights compute a weighted sum of Value vectors. The result: a context-aware representation of the current word, re-encoded to summarise not just itself but the context from the whole sentence.

Softmax turns a list of raw numbers into probabilities that are all positive and add up to one. In self-attention, we compute relevance scores for each word. Softmax normalises them into attention weights so the model can take a weighted average.

Let’s look at the word sat from our sample sentence.

TokenQuery (“sat”) ⋅ KeyRaw ScoreSoftmax Weight (Attention)Value Vector Contribution
thesat · the1.20.050.05 × V(the)
catsat · cat3.50.600.60 × V(cat)
satsat · sat2.90.300.30 × V(sat)
onsat · on1.10.030.03 × V(on)
the (2nd)sat · the1.00.020.02 × V(the)
matsat · mat1.30.050.05 × V(mat)

The word sat “remembers” that cat is likely the subject doing the sitting. Its new vector encodes that relationship without changing the actual tokens.

All of this means the model can predict next words more logically and understand sentence structure better.

One more piece: transformers don’t inherently understand token order. So models bolt on positional embeddings, vectors that encode where each word appears in the input sequence. These positional vectors are added to the word embeddings before processing, helping the model infer sentence structure. (Grammar and structure are learned via attention across these positions.)

Training the model

Models are trained on vast amounts of data (books, articles, websites) using self-supervised learning: the model learns to predict the next word without manually labelled data. It starts with random weights and improves by minimising the difference between its prediction and the actual next word. Training uses backpropagation and stochastic gradient descent, updating billions of parameters. For scale: GPT-3 has 175 billion parameters and takes months to train.

When we say GPT-3 has 175 billion parameters, we mean it has that many individual numbers (weights and biases) learned during training. These are the knobs the model tunes to get better at predicting the next word.

Weights and biases

Think of a neural network as a giant machine that takes in numbers (like word embeddings) and transforms them step by step to make a prediction. Two key ingredients do most of the work: weights and biases.

Weights answer the question “How important is this input?” Imagine mixing cake ingredients: flour, sugar, eggs. You don’t add them equally. That’s exactly what weights do. They control how much influence each input has.

So if a model sees “capital of France is…”, it may assign a higher weight to “France” than to “of”, because “France” is more useful for predicting what comes next.

Technically, weights are numbers the model multiplies inputs by. Higher weight means “pay more attention.” A weight of zero means “ignore it.”

Biases adjust the result before deciding. Like compensating for an oven that always runs 10 degrees cool. In a neural network, the bias shifts the result up or down, giving the network flexibility to make predictions even when all input weights are low. Without biases, the model would be too rigid.

At each neuron (a tiny calculator), the model multiplies each input by its weight, adds them up, adds the bias, then passes the result through an activation function (e.g. ReLU):

output = activation(w1×x1 + w2×x2 + ... + wn×xn + bias)

Where:

x1, x2,... are inputs (like word embeddings)
w1, w2,... are the weights (importance)
bias is the extra tweak
activation shapes the output (e.g. ReLU = turn negatives into 0)

The model doesn’t learn facts like “Paris is the capital of France” in plain text. It tweaks millions (or billions) of weights and biases so that, mathematically, when it sees “capital of France”, the neuron responsible for “Paris” lights up. Weights and biases are just numbers, but they’re what create understanding in the model.

Code Walkthrough

Now that we understand tokenisation, embeddings, attention, and transformer mechanics, let’s put it into practice. We’ll build a small-scale Transformer model using TensorFlow.js. The goal: given partial input like “The capital of France is…” or “Berlin is…”, the model should predict the most likely next words, just like a miniature LLM.

Training the model

We start by defining our vocabulary: all the words the model will recognise. This includes special tokens like <pad> (fills empty space), <unk> (unknown words), and <s> / </s> (sentence boundaries), alongside capital cities, countries, and connecting words. Each word gets a unique number via a wordToIndex map, because the model sees numbers, not words.

const vocab = [
  '<pad>',
  '<unk>',
  '<s>',
  '</s>',
  'the',
  'capital',
  'of',
  'is',
  'in',
  'berlin',
  'germany',
  'paris',
  'france',
  'rome',
  'italy',
  'tokyo',
  'japan',
  'madrid',
  'spain',
  'london',
  'uk',
  'athens',
  'greece',
  'vienna',
  'austria',
  'oslo',
  'norway',
  'stockholm',
  'sweden',
  'cairo',
  'egypt',
  'lisbon',
  'portugal',
];

const wordToIndex = Object.fromEntries(vocab.map((w, i) => [w, i]));
const vocabSize = vocab.length;
const maxSeqLen = 6;
const embedDim = 32;
const ffDim = 64;

We also define hyperparameters: settings that control how the model behaves. embedDim sets the size of our word vectors (32 dimensions). maxSeqLen caps sentences at 6 words. ffDim controls the hidden layer size.

Hyperparameters are settings you choose before training, like vector size (embedDim) or sequence length (maxSeqLen). The model doesn’t learn them; they shape how it learns.

We then create two embedding layers: one for the tokens themselves, another for their positions. Transformers process input all at once, so positional embeddings help the model tell whether a word is at the beginning, middle, or end of a sentence.

const createEmbedding = () =>
  tf.layers.embedding({ inputDim: vocabSize, outputDim: embedDim });

const createPositionalEmbedding = () =>
  tf.layers.embedding({ inputDim: maxSeqLen, outputDim: embedDim });

To prepare text for training, we use a padSequence function that ensures all input sequences are exactly six tokens long. We also define a oneHot function that turns word indices into one-hot vectors (arrays of zeroes with a single one at the correct word position). These tell the model which word it should have predicted.

function padSequence(seq) {
  const padded = seq.map((w) => wordToIndex[w] ?? wordToIndex['<unk>']);
  while (padded.length < maxSeqLen) padded.push(wordToIndex['<pad>']);
  return padded;
}

function oneHot(index, size) {
  return Array.from({ length: size }, (_, i) => (i === index ? 1 : 0));
}

Next: the core of the transformer. We implement a basic scaled dot-product attention block that takes queries, keys, and values, then calculates which words should pay attention to which others. Think of the model asking: “For the word I’m looking at, which other words are relevant, and how much should I care about them?” Softmax turns raw scores into probabilities that sum to one.

function attentionBlock(q, k, v) {
  const scores = tf.layers.dot({ axes: -1 }).apply([q, k]);
  const scaled = tf.layers.activation({ activation: 'softmax' }).apply(scores);
  return tf.layers.dot({ axes: [2, 1] }).apply([scaled, v]);
}

The encoderBlock transforms inputs into queries, keys, and values using dense layers, then runs them through the attention mechanism. After attention, we add the original input back (a residual connection) and normalise it (layer normalisation) to keep training stable. A two-layer feedforward network follows: one layer with ReLU activation (zeroes out negatives), then another dense layer to restore the original dimensions.

A feedforward layer is a regular neural network layer: inputs are multiplied by weights, a bias is added, and an activation function is applied. In Transformers, these layers process attention outputs further.

ReLU (Rectified Linear Unit) is an activation function that replaces negative numbers with zero. It helps the model focus on useful patterns and makes training faster and more stable.

function encoderBlock(input) {
  const q = tf.layers.dense({ units: embedDim }).apply(input);
  const k = tf.layers.dense({ units: embedDim }).apply(input);
  const v = tf.layers.dense({ units: embedDim }).apply(input);
  const attended = attentionBlock(q, k, v);
  const add1 = tf.layers.add().apply([input, attended]);
  const norm1 = tf.layers.layerNormalization().apply(add1);
  const ff = tf.layers.dense({ units: ffDim, activation: 'relu' }).apply(norm1);
  const ffOut = tf.layers.dense({ units: embedDim }).apply(ff);
  const add2 = tf.layers.add().apply([norm1, ffOut]);
  return tf.layers.layerNormalization().apply(add2);
}

The decoderBlock handles output generation. It has two attention stages: self-attention (the decoder attends to its own previously generated tokens) and cross-attention (it looks back at the encoder output to guide predictions). Like the encoder, it includes feedforward layers and residual connections.

function decoderBlock(input, encOutput) {
  const selfQ = tf.layers.dense({ units: embedDim }).apply(input);
  const selfK = tf.layers.dense({ units: embedDim }).apply(input);
  const selfV = tf.layers.dense({ units: embedDim }).apply(input);
  const selfAtt = attentionBlock(selfQ, selfK, selfV);
  const selfAdd = tf.layers.add().apply([input, selfAtt]);
  const selfNorm = tf.layers.layerNormalization().apply(selfAdd);

  const crossQ = tf.layers.dense({ units: embedDim }).apply(selfNorm);
  const crossK = tf.layers.dense({ units: embedDim }).apply(encOutput);
  const crossV = tf.layers.dense({ units: embedDim }).apply(encOutput);
  const crossAtt = attentionBlock(crossQ, crossK, crossV);
  const crossAdd = tf.layers.add().apply([selfNorm, crossAtt]);
  const crossNorm = tf.layers.layerNormalization().apply(crossAdd);

  const ff = tf.layers
    .dense({ units: ffDim, activation: 'relu' })
    .apply(crossNorm);
  const ffOut = tf.layers.dense({ units: embedDim }).apply(ff);
  const addOut = tf.layers.add().apply([crossNorm, ffOut]);
  return tf.layers.layerNormalization().apply(addOut);
}

For training data, we define pairs of country-capital relationships: “berlin is” should produce “the capital of germany”, and “the capital of france” should return “paris”. Each example is converted into number sequences, padded, and processed into tensors (the raw data format TensorFlow expects). Position embeddings are prepared alongside.

A tensor is a multidimensional array. Think of it as a container holding a numeric version of our sentences.

const pairs = [
  ['berlin', 'germany'],
  ['paris', 'france'],
  ['rome', 'italy'],
  ['tokyo', 'japan'],
  ['madrid', 'spain'],
  ['london', 'uk'],
  ['athens', 'greece'],
  ['vienna', 'austria'],
  ['oslo', 'norway'],
  ['stockholm', 'sweden'],
  ['cairo', 'egypt'],
  ['lisbon', 'portugal'],
];

const examples = [];

for (const [capital, country] of pairs) {
  examples.push({
    input: [capital, 'is'],
    output: ['<s>', 'the', 'capital', 'of', country, '</s>'],
  });
  examples.push({
    input: ['the', 'capital', 'of', country],
    output: ['<s>', 'is', capital, '</s>'],
  });
  examples.push({
    input: [capital],
    output: ['<s>', 'is', 'the', 'capital', 'of', country, '</s>'].slice(0, 6),
  }); // short input
}

const encoderInputs = [];
const decoderInputs = [];
const decoderLabels = [];

for (const ex of examples) {
  const enc = padSequence(ex.input);
  const decIn = padSequence(ex.output.slice(0, -1));
  const decOut = padSequence(ex.output.slice(1));
  encoderInputs.push(enc);
  decoderInputs.push(decIn);
  decoderLabels.push(decOut.map((i) => oneHot(i, vocabSize)));
}

const xEnc = tf.tensor2d(
  encoderInputs,
  [encoderInputs.length, maxSeqLen],
  'int32'
);
const xDec = tf.tensor2d(
  decoderInputs,
  [decoderInputs.length, maxSeqLen],
  'int32'
);
const y = tf.tensor3d(decoderLabels, [
  decoderLabels.length,
  maxSeqLen,
  vocabSize,
]);

const positionSeq = [...Array(maxSeqLen).keys()];
const xEncPos = tf.tensor2d(
  Array(encoderInputs.length).fill(positionSeq),
  [encoderInputs.length, maxSeqLen],
  'int32'
);
const xDecPos = tf.tensor2d(
  Array(decoderInputs.length).fill(positionSeq),
  [decoderInputs.length, maxSeqLen],
  'int32'
);

Now that our data is numerically encoded, we feed it into the model. Both sides (encoder and decoder) need to know what the tokens are (via token IDs) and where they are (via position IDs). These inputs are embedded into vector space, then processed to generate context-aware outputs.

const encInput = tf.input({ shape: [maxSeqLen], dtype: 'int32' });
const encPosInput = tf.input({ shape: [maxSeqLen], dtype: 'int32' });
const decInput = tf.input({ shape: [maxSeqLen], dtype: 'int32' });
const decPosInput = tf.input({ shape: [maxSeqLen], dtype: 'int32' });

const tokEmb = createEmbedding();
const posEmb = createPositionalEmbedding();

const encTokens = tokEmb.apply(encInput);
const encPositions = posEmb.apply(encPosInput);
const encCombined = tf.layers.add().apply([encTokens, encPositions]);
const encOutput = encoderBlock(encCombined);

const decTokens = tokEmb.apply(decInput);
const decPositions = posEmb.apply(decPosInput);
const decCombined = tf.layers.add().apply([decTokens, decPositions]);
const decOutput = decoderBlock(decCombined, encOutput);

In the code above we take a sentence pair like “berlin is” and “the capital of germany”. The input (“berlin is”) becomes token IDs (e.g., [7, 6]) with position IDs ([0, 1]). The target is prepended with <s> and appended with </s>, creating a sequence like [1, 3, 4, 5, 8, 2] with positions [0, 1, 2, 3, 4, 5]. All four inputs pass through embedding layers, letting the encoder process the input and the decoder generate output step by step.

At the end of the decoder, a dense layer with softmax turns output into probabilities over every word in the vocabulary. The model picks the word with the highest probability at each position.

const logits = tf.layers
  .dense({ units: vocabSize, activation: 'softmax' })
  .apply(decOutput);
const model = tf.model({
  inputs: [encInput, encPosInput, decInput, decPosInput],
  outputs: logits,
});

Once the decoder has processed the target sequence and attended to the encoder’s representation, it produces context-aware vectors for each output position. These aren’t words yet; they’re high-dimensional summaries of what the model thinks should come next. The final dense layer with softmax transforms each vector into a probability distribution over the entire vocabulary. The highest-probability word wins at each position. This is where numerical computations turn back into human language.

We compile the model using the Adam optimiser and categorical cross-entropy as the loss function (standard for multi-class classification like picking the correct next word). We train for 300 epochs, meaning the model processes the entire dataset 300 times. With our small dataset, that’s enough for it to memorise the relationships.

model.compile({
  optimizer: 'adam',
  loss: 'categoricalCrossentropy',
  metrics: ['accuracy'],
});

console.log('🚀 Training...');
await model.fit([xEnc, xEncPos, xDec, xDecPos], y, {
  epochs: 300,
  verbose: 0,
  callbacks: {
    onEpochEnd: (epoch, logs) => {
      if ((epoch + 1) % 100 === 0) {
        console.log(
          `📊 Epoch ${epoch + 1}: Loss = ${logs.loss.toFixed(
            4
          )} | Accuracy = ${(logs.acc * 100).toFixed(2)}%`
        );
      }
    },
  },
});

After training, we save the model to disk so we can load it later for inference.

await model.save('file://./saved-transformer-model');
console.log('💾 Model saved to ./saved-transformer-model/');

A note on parameters

GPT-3 has 175 billion parameters. Our Transformer has around 38,000 trainable parameters. Here’s where they come from:

The token embedding layer turns each of 34 words into a 32-dimensional vector: roughly 1,088 parameters. The positional embedding layer adds another 192. The two encoder blocks (each with attention layers and feedforward networks) contribute roughly 14,000 parameters. The two decoder blocks (heavier, because they handle both self-attention and cross-attention) bring in around 20,000. The final dense output layer adds about 1,000 more.

So: 1k (embeddings) + 192 (positions) + 14k (encoders) + 20k (decoders) + 1k (final output) = ~38,000 total parameters.

Small compared to GPT-3 or Gemini, but structurally identical. We just scaled it down to something trainable on a laptop in seconds rather than months.

Using the model

With the model trained, we can put it to work.

We redefine the vocabulary and mappings (the model only understands numbers, so we need to speak its language at inference time too). Without these mappings, the model would produce IDs we can’t interpret and receive inputs it wasn’t trained to handle.

const vocab = [
  // same vocab as in the file used for training
];

The input (something like “berlin” or “paris is”) is converted into token IDs using the vocabulary. Shorter inputs get padded with <pad> to reach the expected length of six words. Position indices are assigned to help the model understand word order.

const wordToIndex = Object.fromEntries(vocab.map((w, i) => [w, i]));
const indexToWord = Object.fromEntries(vocab.map((w, i) => [i, w]));

const maxSeqLen = 6;
const topK = 3;

function padSequence(seq) {
  const padded = seq.map((w) => wordToIndex[w] ?? wordToIndex['<unk>']);
  while (padded.length < maxSeqLen) padded.push(wordToIndex['<pad>']);
  return padded;
}

function positionSequence() {
  return [...Array(maxSeqLen).keys()];
}

We load the trained model from disk. The decoder generates output starting from the <s> (start) token. At every step, it looks at what it’s generated so far, attends to the encoder’s output, and predicts the next word.

Rather than always picking the single most likely word, we use top-k sampling: take the top three most probable words and pick one based on their relative likelihoods. This prevents repetitive, rigid output and adds controlled creativity.

function topKSample(probArray, k = 3) {
  const indexed = probArray.map((p, i) => ({ i, p }));
  const top = indexed.sort((a, b) => b.p - a.p).slice(0, k);
  const total = top.reduce((sum, item) => sum + item.p, 0);
  const rnd = Math.random() * total;

  let cumulative = 0;
  for (const { i, p } of top) {
    cumulative += p;
    if (rnd < cumulative) return i;
  }
  return top[0].i;
}

const model = await tf.loadLayersModel(
  'file://./saved-transformer-model/model.json'
);
console.log('✅ Model loaded.');

At each decoding step, we log the top five candidate words and their probabilities so we can see the model’s reasoning. The process repeats until the model produces a stop token (</s>) or hits the maximum length.

async function generate(inputWords) {
  const encIds = padSequence(inputWords);
  const encPos = positionSequence();

  const encInput = tf.tensor2d([encIds], [1, maxSeqLen], 'int32');
  const encPosTensor = tf.tensor2d([encPos], [1, maxSeqLen], 'int32');

  let outputWords = ['<s>'];

  for (let i = 0; i < maxSeqLen - 1; i++) {
    const decIds = padSequence(outputWords);
    const decPos = positionSequence();

    const decInput = tf.tensor2d([decIds], [1, maxSeqLen], 'int32');
    const decPosTensor = tf.tensor2d([decPos], [1, maxSeqLen], 'int32');

    const prediction = model.predict([
      encInput,
      encPosTensor,
      decInput,
      decPosTensor,
    ]);
    const probs = await prediction.array();

    const topProbs = probs[0][i]
      .map((p, idx) => ({ word: indexToWord[idx], prob: p }))
      .sort((a, b) => b.prob - a.prob)
      .slice(0, 5);

    const contextText = outputWords
      .map((tok, i) => (tok === '<s>' ? `<s = ${inputWords.join(' ')}>` : tok))
      .join(' ');

    console.log(`\n🔢 Step ${i + 1} - context: "${contextText}"`);

    console.table(
      topProbs.map((t) => ({
        Token: t.word,
        Probability: (t.prob * 100).toFixed(2) + '%',
      }))
    );

    const nextIdx = topKSample(probs[0][i], topK);
    const nextWord = indexToWord[nextIdx];

    if (nextWord === '</s>' || nextWord === '<pad>') break;
    outputWords.push(nextWord);
  }

  console.log(`🧪 Input: "${inputWords.join(' ')}"`);
  console.log(`💬 Output: ${outputWords.slice(1).join(' ')}`);
}

We print both the input and generated output, showing how the model constructs full responses from minimal input:

await generate(['paris', 'is']);
await generate(['the', 'capital', 'of', 'sweden']);
await generate(['berlin']);
await generate(['oslo', 'is']);
await generate(['the', 'capital', 'of', 'portugal']);

Here’s what the inference looks like in practice:

$

Conclusion

There you have it: a hands-on walkthrough of how a transformer model works, built from scratch with TensorFlow.js. Our model only maps country names to capitals using a few dozen training examples. But the core mechanics (tokenisation, embeddings, attention, encoder-decoder architecture) are the same ones powering the massive LLMs behind today’s AI.

Now scale it up. Instead of a dozen sentences, imagine training on billions of text samples from books, articles, websites, and forums. Instead of a few thousand weights, picture hundreds of billions of parameters, all fine-tuned through endless iterations. That’s the world of GPT and Gemini.

What we built is a miniature replica of that machinery. The difference is size, data, and scale, not principle. GPT (Generative Pretrained Transformer) is a decoder-only architecture: it takes input tokens and predicts what comes next, one token at a time. (During training it predicts the entire output sequence in parallel, but during generation it’s autoregressive.) There’s no separate encoder; it models language purely through self-attention over previously generated tokens. This makes it excellent for continuation tasks like chat, writing, or code completion.

Gemini and other encoder-decoder models process inputs more holistically, making them better suited for tasks requiring strong comprehension and multimodal understanding, like summarisation, translation, or combining text with images.

In both cases, the foundational ideas are remarkably similar to what we’ve explored here. Transformers don’t understand language the way we do. They calculate relationships between tokens, capture patterns through training, and optimise billions of small weight updates to get better at guessing what comes next. Yet through this brute-force statistical pattern recognition, they can generate poetry, explain quantum mechanics, and write working code. Hopefully this journey into building a tiny Transformer has peeled back the curtain a little, showing that behind the seemingly magical capabilities of LLMs lies a deeply logical and elegant structure.