Filling in the Blanks: Teaching AI to Inpaint

Based on a previous article, I thought I'd do another deep-dive into creating yet another model - this time around, one that is capable of inpainting. The idea come from the fact that I really enjoy Cloudinary's inpainting feature and I was curious to learn more about how this technique works. Let's dive in!

What is inpainting?

Inpainting is a technique used to add/restore missing or corrupted parts of an image. The idea is simple. Consider an image that was taken using portait mode.

Attempting to use this image as a poster (i.e., in landscape) - maybe for promotional purposes - is going to be very difficult. Our best option would be to take the most dominant colour of this image and use that to backfill the two edges:

Note that for the above image we are using Cloudinary's b_auto which does backfill the background by detecting the most dominant colour of the image automatically.

This is definitely not looking great. How about if we do this?

This looks a lot better. There are a few tell-tale signs that the image above was inpainted using AI but those are all negligible and the final result looks a lot better compared to just filling the extra space with a single colour.

Now the real question is - how does AI do this? And can we build something similar using TensorFlow?

How does AI Learn?

Essentially we give AI images, with a block missing (masked out), we also show it the original image (which is the ground truth) and over many such examples it learns how to fill in the blanks. It does this by utilising neural networks.

The easiest way to think about neural networks is to think of a function approximator: it tries to learn patterns in data by adjusting weights over time. In the case of what we are building here, we want to approximate the mapping from the masked image to the original image. To achieve this, we can use a special type of neural network called an autoencoder.

What is an autoencoder?

An autoencoder is a neural network architecture that compresses input data (such as our images) into a smaller representation (i.e., it does the encoding) and then it reconstructs the input from that smaller representation (i.e. it does the decoding). Our goal here is going to be to get the autoencoder's decoder to reconstruct the full image from its masked version.

Please note that an autoencoder doesn't necessarily compress - its purpose is to transform an input signal into a latent representation (encode), and then reconstruct the original signal from this representation (decode). The design of the latent space can serve different functions, for example:

Using a smaller latent spatial size (i.e., smaller height and width) helps the model understand shapes and patterns over local pixel regions.

Designing a compressed latent space enables the model to act like a traditional compression algorithm.

Structuring the latent space with specific mathematical properties - such as a multivariate Gaussian - allows for sampling and generating new outputs from random noise (as in variational autoencoders).

We need layers of these autoencoders, and to get those done we need to have a Convolutional Neural Network (CNN). CNNs are designed to work with images as instead of connecting every pixel to every other pixel (which would be way too slow), they use filters to scan the image. CNNs are also very good at identifying spatial hierarchies in data and as such identify and learn edges that make up shapes, shapes that make up objects and objects that make up scenes. CNNs work with a sliding window operator (referred to as a "convolution") with a small convolutional kernel that is trained implicitly to detect patterns in the image, the same way they are used when you design a filter in classic computer vision task. You can think of someone scanning an image through a magnifying glass, looking for patterns in small regions and learning those patterns.

Now that we understand the theory, let's walk through how it all comes together in code.

Code Walkthrough

Our model will be able to work with 32x32 grayscale images and we'll be using masks to black out parts of the image for the model to learn from. Instead of using a sample dataset we'll create our own syntetic data.

The first step in this process is to configure a bunch of parameters such as the learning rate, epochs and the aforementioned width and height for the images.

const IMAGE_HEIGHT = 32;
const IMAGE_WIDTH = 32;
const IMAGE_CHANNELS = 1;

const NUM_TRAINING_SAMPLES = 1000;
const EPOCHS = 150;
const BATCH_SIZE = 8;
const LEARNING_RATE = 0.0005;
const MODEL_SAVE_PATH = 'file://./my-inpainting-model';

We are going to train on 1000 32x32 pixel grayscale images. In the configuration above, epochs determines how many times we loop over the training set (how many full training iterations we do). The learning rate tells the model how big a step it should take when trying to improve. As the model learns, it makes predictions and checks how wrong it was (this is what we call loss), and then it adjusts its internal weights to do better next time. What we control with the learning rate is the size of these adjustments. Generally, a smaller learning rate helps the model converge more smoothly, and reduces the risk of overshooting the optimal solution. The best analogy for this that I have come across is to think about the learning rate as riding a rollercoaster. A slower rollercoaster going down (smaller learning rate) would mean that you are reaching down the track section slowly but carefully whereas with a larger learning rate, you may get to the bottom faster, but it could be riskier because you may not be able to stop at the bottom where you intended and bounce back up (overshoot) therefore missing the bottom completely.

Generally a too small learning rate means that the model will be very slow to learn (or get stuck) and may not learn at all. Larger learning rate means that it may miss the ideal training point. What you want to aim for is a gradual, stable convergence. This requires some testing from your end, it's very rare to get this right for the first time.

Let's continue with the code walkthrough. Next up we define the inpainting model.

function createInpaintingModel(inputShape) {
  const model = tf.sequential();

  model.add(tf.layers.conv2d({
    inputShape: inputShape, filters: 32, kernelSize: 3, activation: 'relu', padding: 'same'
  }));
  model.add(tf.layers.maxPooling2d({ poolSize: 2, strides: 2 }));

  model.add(tf.layers.conv2d({
    filters: 64, kernelSize: 3, activation: 'relu', padding: 'same'
  }));
  model.add(tf.layers.maxPooling2d({ poolSize: 2, strides: 2 }));

  model.add(tf.layers.conv2d({
    filters: 128, kernelSize: 3, activation: 'relu', padding: 'same'
  }));
  model.add(tf.layers.maxPooling2d({ poolSize: 2, strides: 2 }));

  model.add(tf.layers.conv2dTranspose({
    filters: 128, kernelSize: 3, strides: 2, activation: 'relu', padding: 'same'
  }));

  model.add(tf.layers.conv2dTranspose({
    filters: 64, kernelSize: 3, strides: 2, activation: 'relu', padding: 'same'
  }));

  model.add(tf.layers.conv2dTranspose({
    filters: 32, kernelSize: 3, strides: 2, activation: 'relu', padding: 'same'
  }));

  model.add(tf.layers.conv2d({
    filters: inputShape[2], kernelSize: 3, activation: 'sigmoid', padding: 'same'
  }));

  model.compile({
    optimizer: tf.train.adam(LEARNING_RATE),
    loss: 'meanSquaredError'
  });
  model.summary();
  return model;
}

This is the convolutional autoencoder - as we discussed earlier, it's made up of an encoder and a decoder. We are using conv2d to detect features like edges or patterns, maxPooling2d which shrinks the image while keeping important bits, conv2dTranspose which upsamples (grows) the image back to its original size and a final conv2d which produces the output image using a sigmoid activation to keep pixel values between 0 and 1.

conv2d (the one learning the basic patterns) uses a kernel size of 3 - so each filter (kernel) is a 3x3 "window" that scans the image. (padding: 'same' is just making sure that the output size is the same as the input).

maxPooling2d reduces the image dimensions by half (so we go from 32x32 to 4x4 while increasing the feature maps from 32 to 128) - all we do here is to capture more abstract features in less amount of space.

And we do all of these steps, but in reverse to reconstruct the image and we use conv2dTranspose (we go from 4x4 back to 32x32)

Last but not least we use the Adam optimiser with meanSquaredError as our loss function which basically compares each pixel in the output (where the output is the result of the reconstructed image) vs the original image. Using meanSquaredError means that large differences between the output and the input will be penalised more heavily (note that this is our loss function).

Loss is a way of measuring how well the model "scores on a test" - the lower the loss is, the better. Now, the Mean Squared Error (MSE) is our way to measure how wrong the model's predictions are - simply by calculating the average of the squared differences between the predicted and actual values. This is done for every pixel. This is how it'd look like in a loop: Input image --> [Model] --> Predicted image --> [MSE Loss] --> Backpropagation --> Updated model. The model produces an output, and it gets told "your prediction was off by X amount". It learns how it was wrong. It adjusts its internal logic slightly to do better next time. And this continues until the number of epochs - in our case that's 150 - so 150 times.

The next function generates syntehtic training images. These are just simply, images generated on the fly.

function generateSyntheticImage(height, width, channels) {
  return tf.tidy(() => {
    const canvas = tf.buffer([height, width, channels], 'float32');
    const backgroundColor = Math.random() * 0.3;
    for (let r = 0; r < height; r++) { for (let c = 0; c < width; c++) { canvas.set(backgroundColor, r, c, 0); } }

    const numShapes = Math.floor(Math.random() * 3) + 2;

    for (let i = 0; i < numShapes; i++) {
      const color = Math.random() * 0.7 + 0.3;

      const sh = Math.floor(Math.random() * (height / 1.8)) + 4;
      const sw = Math.floor(Math.random() * (width / 1.8)) + 4;
      const sr = Math.floor(Math.random() * (height - sh));
      const sc = Math.floor(Math.random() * (width - sw));
      for (let r = sr; r < sr + sh; r++) { for (let c = sc; c < sc + sw; c++) { if (r < height && c < width) { canvas.set(color, r, c, 0); } } }
    }
    return canvas.toTensor();
  });
}

The next function applies a mask to our images randomly. This will create the masked out regions.

function applyMask(originalTensor) {
  return tf.tidy(() => {
    const [height, width, channels] = originalTensor.shape;

    const minMaskDimRatio = 0.375;
    const maxMaskDimRatio = 0.75;

    const minH = Math.floor(height * minMaskDimRatio);
    const maxH = Math.floor(height * maxMaskDimRatio);
    const actualMaskHeight = Math.floor(Math.random() * (maxH - minH + 1)) + minH;

    const minW = Math.floor(width * minMaskDimRatio);
    const maxW = Math.floor(width * maxMaskDimRatio);
    const actualMaskWidth = Math.floor(Math.random() * (maxW - minW + 1)) + minW;

    const maskBuffer = tf.buffer([height, width, channels], 'float32');
    for (let r = 0; r < height; r++) { for (let c = 0; c < width; c++) { for (let ch = 0; ch < channels; ch++) { maskBuffer.set(1.0, r, c, ch); } } }

    const maskY = Math.floor(Math.random() * (height - actualMaskHeight));
    const maskX = Math.floor(Math.random() * (width - actualMaskWidth));

    for (let r = maskY; r < maskY + actualMaskHeight; r++) {
      for (let c = maskX; c < maskX + actualMaskWidth; c++) {
        for (let ch = 0; ch < channels; ch++) {
          maskBuffer.set(0.0, r, c, ch);
        }
      }
    }
    const maskTensor = maskBuffer.toTensor();
    return originalTensor.mul(maskTensor);
  });
}

This function is just solely responsible for mimicking damage or missing parts in an image. Take a look at the table below for an example on how the original and the masked image looks like. (These are actual samples from the training set.)

Sample Type	Image
Original
Masked

So why are we doing this? We need to provide learning samples to our model to learn and understand the missing parts of the image. By masking out certain regions, we can train the model to predict what those missing parts should look like based on the surrounding context.

Next up, we'll create the entire dataset of masked images and their original pairs.

function generateDataset(numSamples, height, width, channels) {
  const originalImagesArray = [];
  const maskedImagesArray = [];
  console.log('Generating synthetic training data...');
  for (let i = 0; i < numSamples; i++) {
    const originalTensor = generateSyntheticImage(height, width, channels);
    const maskedTensor = applyMask(originalTensor);
    originalImagesArray.push(originalTensor);
    maskedImagesArray.push(maskedTensor);
  }
  const originalImagesTensor = tf.stack(originalImagesArray);
  const maskedImagesTensor = tf.stack(maskedImagesArray);
  originalImagesArray.forEach(t => t.dispose());
  maskedImagesArray.forEach(t => t.dispose());
  return { originalImagesTensor, maskedImagesTensor };
}

The above is what enables us to truly have supervised learning for our model. We provide it with the inputs and the expected outputs.

Last but not least we are going to train and save the model.

async function trainAndSaveModel(model, maskedImages, originalImages, epochs, batchSize, savePath) {
  console.log('\nStarting training...');
  await model.fit(maskedImages, originalImages, {
    epochs: epochs,
    batchSize: batchSize,
    shuffle: true,
    callbacks: {
      onEpochEnd: (epoch, logs) => {
        console.log(`Epoch ${epoch + 1}/${epochs} - Loss: ${logs.loss.toFixed(5)}`);
      }
    }
  });
  console.log('Training complete.');
  await model.save(savePath);
  console.log(`Model saved to ${savePath}`);
}

The .fit() method trains the model using the pars of [masked -> original] images. We are also printing out some tracking information - the epochs and the loss function result. Once the training is complete, the model is also saved to disk.

All that's left is to call these functions.

async function mainTrain() {
  const inputShape = [IMAGE_HEIGHT, IMAGE_WIDTH, IMAGE_CHANNELS];
  const model = createInpaintingModel(inputShape);

  const { originalImagesTensor, maskedImagesTensor } = generateDataset(
    NUM_TRAINING_SAMPLES, IMAGE_HEIGHT, IMAGE_WIDTH, IMAGE_CHANNELS
  );

  await trainAndSaveModel(model, maskedImagesTensor, originalImagesTensor, EPOCHS, BATCH_SIZE, MODEL_SAVE_PATH);

  originalImagesTensor.dispose();
  maskedImagesTensor.dispose();
  model.dispose();
  console.log(`TensorFlow.js backend: ${tf.getBackend()}, Tensors in memory after training: ${tf.memory().numTensors}`);
}

mainTrain().catch(console.error);

Alright, so far so good. Depending on available resources on your machine this model could be trained in a few minutes and up to 20 minutes. Once the training is done, we'll need to use the model and put it to a test.

Using the trained model

We need to have the same configuration, image generation and masking functions that we used during the training (in fact they could all be extracted to a separate file).

const IMAGE_HEIGHT = 32;
const IMAGE_WIDTH = 32;
const IMAGE_CHANNELS = 1;
const MODEL_LOAD_PATH = 'file://./my-inpainting-model/model.json';

function generateSyntheticImage(height, width, channels) {
  return tf.tidy(() => {
    const canvas = tf.buffer([height, width, channels], 'float32');
    const backgroundColor = Math.random() * 0.3;
    for (let r = 0; r < height; r++) { for (let c = 0; c < width; c++) { canvas.set(backgroundColor, r, c, 0); } }
    const numShapes = Math.floor(Math.random() * 3) + 2;
    for (let i = 0; i < numShapes; i++) {
      const color = Math.random() * 0.7 + 0.3;
      const sh = Math.floor(Math.random() * (height / 1.8)) + 4;
      const sw = Math.floor(Math.random() * (width / 1.8)) + 4;
      const sr = Math.floor(Math.random() * (height - sh));
      const sc = Math.floor(Math.random() * (width - sw));
      for (let r = sr; r < sr + sh; r++) { for (let c = sc; c < sc + sw; c++) { if (r < height && c < width) { canvas.set(color, r, c, 0); } } }
    }
    return canvas.toTensor();
  });
}

function applyMask(originalTensor) {
  return tf.tidy(() => {
    const [height, width, channels] = originalTensor.shape;
    const minMaskDimRatio = 0.375;
    const maxMaskDimRatio = 0.75;
    const minH = Math.floor(height * minMaskDimRatio);
    const maxH = Math.floor(height * maxMaskDimRatio);
    const actualMaskHeight = Math.floor(Math.random() * (maxH - minH + 1)) + minH;
    const minW = Math.floor(width * minMaskDimRatio);
    const maxW = Math.floor(width * maxMaskDimRatio);
    const actualMaskWidth = Math.floor(Math.random() * (maxW - minW + 1)) + minW;
    const maskBuffer = tf.buffer([height, width, channels], 'float32');
    for (let r = 0; r < height; r++) { for (let c = 0; c < width; c++) { for (let ch = 0; ch < channels; ch++) { maskBuffer.set(1.0, r, c, ch); } } }
    const maskY = Math.floor(Math.random() * (height - actualMaskHeight));
    const maskX = Math.floor(Math.random() * (width - actualMaskWidth));
    for (let r = maskY; r < maskY + actualMaskHeight; r++) {
      for (let c = maskX; c < maskX + actualMaskWidth; c++) {
        for (let ch = 0; ch < channels; ch++) {
          maskBuffer.set(0.0, r, c, ch);
        }
      }
    }
    const maskTensor = maskBuffer.toTensor();
    return originalTensor.mul(maskTensor);
  });
}

Next up we have a function to convert a tensor to an image.

async function tensorToImage(tensor, outputPath) {
  let imageTensor = tensor;
  if (tensor.rank === 4 && tensor.shape[0] === 1) {
    imageTensor = tensor.squeeze([0]);
  } else if (tensor.rank !== 3) {
    console.error(`tensorToImage expects a 3D tensor or a 4D tensor with batch size 1, but got rank ${tensor.rank}`);
    if (tensor !== imageTensor && !imageTensor.isDisposed) imageTensor.dispose();
    return;
  }
  const intTensor = tf.tidy(() => imageTensor.mul(255).asType('int32'));
  try {
    const pngData = await tf.node.encodePng(intTensor);
    fs.writeFileSync(outputPath, pngData);
    console.log(`Image saved to ${outputPath}`);
  } catch (error) {
    console.error(`Failed to save image to ${outputPath}:`, error);
  } finally {
    intTensor.dispose();
    if (tensor !== imageTensor && !imageTensor.isDisposed) imageTensor.dispose();
  }
}

Tensors is how TensorFlow represents data, but we cannot open tensors like a regular image file. So we must do some conversion. Namely, we need to multiply by 255 to scale the values back up to standard image brightness levels and then save and write the files as pngs.

And last but not least comes the fun bit. To see if our model is actually doing what it's supposed to be doing.

async function mainInference() {
  console.log(`Attempting to load model from: ${MODEL_LOAD_PATH}`);
  let model;
  try {
    model = await tf.loadLayersModel(MODEL_LOAD_PATH);
    console.log('Model loaded successfully.');
    model.summary();
  } catch (error) {
    console.error('Failed to load the model:', error);
    console.error(`Please ensure the model exists at the specified path: ${MODEL_LOAD_PATH}`);
    console.error('You might need to run the updated train_inpainter.js script first (and delete any old model directory).');
    return;
  }

  console.log('\nPerforming inference on a new test sample...');
  const testOriginalTensor = generateSyntheticImage(IMAGE_HEIGHT, IMAGE_WIDTH, IMAGE_CHANNELS);
  const testMaskedTensor = applyMask(testOriginalTensor);

  const testMaskedBatched = testMaskedTensor.expandDims(0);
  const inpaintedTensorBatched = tf.tidy(() => model.predict(testMaskedBatched));

  await tensorToImage(testOriginalTensor, 'inference_original_v2.png');
  await tensorToImage(testMaskedTensor, 'inference_masked_input_v2.png');
  await tensorToImage(inpaintedTensorBatched, 'inference_inpainted_output_v2.png');

  console.log("\nInference complete. Check 'inference_*_v2.png' images.");

  testOriginalTensor.dispose();
  testMaskedTensor.dispose();
  testMaskedBatched.dispose();
  inpaintedTensorBatched.dispose();
  model.dispose();

  console.log(`TensorFlow.js backend: ${tf.getBackend()}, Tensors in memory after inference: ${tf.memory().numTensors}`);
}

First we load our trained model and we also load two tensors. testOriginalTensor is the correct, input image and testMaskedTensor is the "damaged" (masked) version of the image - the one that has the mask, which, if all goes well, the model should be able to inpaint ("fix"). We also need to do some batching. The reason for this is because the model was trained on baches of images. Even if we pass a single image to the model, it still expects a 4D tensor ([BATCH_SIZE, 32, 32, 1] where 32 are the dimensions and 1 is for grayscale). expandDims does that - it expands the tensor shape from [32, 32, 1] to [1, 32, 32, 1] where the leading 1 indicates that there's 1 image in the batch.

The magic happens inside model.predict() - this is where the model tries to predict the output. So it receives an original image, a masked out image and it tries to inpaint that. All of these images are saved as png files. Let's see how well the model is doing. The table below shows the result:

Type	Image
Original Image
Masked Input
Inpainted Output

Not bad for a model that we created and trained in a couple of minutes on our machine. The example that I shared in the very beginning (where the image was backpained using flora and fauna) uses a similar model that was trained on a much, much larger dataset to recognise all sorts of different patterns. What you saw here is a very simple form of the same concept.

Conclusion

This project started with a simple goal: to teach a small AI model how to fill in the masked out parts of an image.

As web developers, we often see AI as this intimidating black box, especially when it comes to visual tasks like image generation or restoration. But by building a small inpainting model with TensorFlow.js, we broke through that barrier. We didn't need a giant dataset, a GPU cluster, or a PhD. Just a bit of JavaScript, a handful of synthetic images, and a willingness to explore.

Along the way, we learned about:

Convolutional Neural Networks (CNNs) and why they're good at handling images
Autoencoders, and how they encode and decode (reconstruct) data
The role of loss functions and optimizers in model learning
How to prepare data, handle tensors, and and train a model in TensorFlow.js
The importance of input shape matching and why batching matters - even when predicting a single image

At the end we built a system to visualise our model's performance, saving images that let us see exactly how well -or how poorly - our model handled the task of image repair.

Whether you're curious about generative AI, looking to build smarter tools, or just eager to explore the future of creative development - this kind of hands-on experimentation is the perfect first step. Start small. Stay visual. And trust that every pixel you generate, mask, or restore is teaching you something fundamental about how machines learn.