Skip to main content

Consuming Streamed LLM Responses on the Frontend: A Deep Dive into SSE and Fetch

10 min read

LLMs generate responses token by token. You can either wait for the whole thing to finish (showing a spinner while the user stares at nothing) or stream those tokens to the frontend as they land. The second option wins every time.

This article covers two ways to consume streamed LLM responses from a frontend application: Server-Sent Events (SSE) and the fetch API with Readable Streams.

The Power of Streaming

When a user sends a prompt to an LLM, the model generates the response token by token. Instead of waiting for the entire response (which can take several seconds), we stream those tokens to the frontend as they become available.

The benefits:

  • Reduced Perceived Latency: Users start seeing the response almost instantly, which makes the application feel far snappier.
  • Better User Experience: The real-time, typewriter-like effect is a more engaging and natural way to interact with an AI.
  • Efficient Resource Utilisation: By processing the response as a stream, we avoid holding large chunks of data in memory on both server and client.

Method 1: Server-Sent Events (SSE)

Server-Sent Events is a simple, efficient technology for pushing real-time data from a server to a client over a single, long-lived HTTP connection. It’s a natural fit for streaming LLM responses because it’s one-way: server to client.

How it Works

  1. The client establishes a connection to a server endpoint configured to send SSE.
  2. The server keeps the connection open and sends data as “events.”
  3. Each event is a simple text-based message with a specific format.

The events need to be formatted as data: <your_data>\n\n. Yes, you literally need the data keyword followed by a colon and a space before your data. That’s a requirement of the SSE protocol. You also need a newline character (\n) at the end of each event.

Method 2: The fetch API with Readable Streams

The fetch API also provides a way to work with streaming responses. When a server sends a response with a Transfer-Encoding: chunked header, the fetch API lets you read the response body as a ReadableStream.

How it Works

  1. The client makes a fetch request to a server endpoint.
  2. The server sends the response body in chunks.
  3. The client reads these chunks as they arrive using a ReadableStream and a TextDecoder.

Example server implementation

This Node.js code creates a simple HTTP server that streams responses using the Google Generative AI SDK to a frontend application. It demonstrates both methods described above.

Core Components and Setup

The server starts by importing necessary modules and setting up the connection to the Google AI service.

  • import { createServer } from 'node:http';: The fundamental http module from Node.js for creating an HTTP server.
  • import { GoogleGenAI } from '@google/genai';: The official Google AI SDK for Node.js, providing an interface to interact with the Gemini family of models.
  • import url from 'node:url';: A utility module for parsing URL strings, so the server can read the requested path and query parameters.
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const model = 'gemini-2.0-flash';
  • AI Client Initialisation: An instance of GoogleGenAI is created using an API key stored in environment variables (process.env.GEMINI_API_KEY).
  • Model Selection: The variable model is set to 'gemini-2.0-flash', a fast and efficient model suited to chat and real-time generation tasks.

Server and Request Handling

The core logic lives inside the createServer callback, which fires for every incoming request.

const server = createServer(async (req, res) => {
  // Set CORS Headers
  res.setHeader('Access-Control-Allow-Origin', 'http://localhost:8080');
  res.setHeader('Access-Control-Allow-Methods', 'GET');
  res.setHeader('Access-Control-Allow-Headers', 'Content-Type');

  // Parse URL and get the prompt
  const { pathname, query } = url.parse(req.url ?? '', true);
  const prompt = query.prompt || 'What is Star Wars?';
  • CORS Headers: The res.setHeader calls enable Cross-Origin Resource Sharing. They explicitly permit a frontend running on http://localhost:8080 to access this server on a different port (3000).
  • URL Parsing & Prompt Extraction: The server parses the request URL to determine the endpoint (pathname) and pulls the user’s prompt from the query string. If no prompt is provided, it falls back to a default value.

Interacting with the Gemini API

The server uses generateContentStream to get a real-time stream from the AI.

  const response = await ai.models.generateContentStream({
    model,
    contents: prompt,
    config: {
      systemInstruction: 'Please keep your response short and concise. Maximum 200 words.'
    }
  });

This is the key interaction with the Gemini API. Instead of waiting for the full response, generateContentStream returns an asynchronous iterable. The server can loop through response chunks as the model generates them. A systemInstruction is bolted on to guide the AI’s tone and length.

The Streaming Endpoints

The server logic then branches based on the requested pathname.

1. The /sse Endpoint (Server-Sent Events)

This endpoint is built for clients using the EventSource API (more on this shortly).

if (pathname === '/sse') {
  res.writeHead(200, {
    'Content-Type': 'text/event-stream',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
  });

  for await (const chunk of response) {
    res.write(`data: ${chunk.text}\n\n`);
  }

  res.write('event: done\ndata: [DONE]\n\n');
  res.end();
}
  • Headers: The res.writeHead method sends a 200 OK status with three critical headers:
    • Content-Type: text/event-stream: Tells the client to process the response as an event stream.
    • Cache-Control: no-cache: Ensures the client always gets a fresh response.
    • Connection: keep-alive: Keeps the HTTP connection open to push multiple events.
  • Event Formatting: Inside the for await...of loop, each chunk from the AI is formatted per the SSE protocol: data: <text_chunk>\n\n. The data: prefix is mandatory; the double newline \n\n signals the end of a single event.
  • Custom ‘done’ Event: After the AI stream concludes, a final custom event (event: done) signals to the frontend that transmission is complete.

2. The /fetch Endpoint (Chunked Response)

This endpoint provides a raw text stream, suited for consumption with the fetch API and ReadableStream.

else if (pathname === '/fetch') {
  res.writeHead(200, {
    'Content-Type': 'text/plain',
    'Transfer-Encoding': 'chunked',
    'Cache-Control': 'no-cache',
  });

  for await (const chunk of response) {
    res.write(chunk.text);
  }

  res.end();
}
  • Headers: Different from SSE:
    • Content-Type: text/plain: The data is plain text.
    • Transfer-Encoding: chunked: The key header telling the client the response body arrives in a series of chunks rather than all at once.
  • Data Transmission: The loop iterates through the AI’s response, writing chunk.text directly to the response stream with no special formatting.
  • End of Stream: res.end() closes the connection and signals the end of the chunked response.

Starting the Server

The server.listen method starts the server.

server.listen(3000, () => {
  console.log('Server running at http://localhost:3000');
  console.log('Try /sse or /fetch');
});

From your CLI: node --experimental-strip-types --watch --env-file=.env server.ts

Frontend Implementation

Now the frontend. It provides a user interface to interact with the streaming server, letting users enter a prompt and choose one of two methods (SSE or fetch()) to stream the response from the Gemini API. The code uses marked.js to render incoming Markdown as formatted HTML in real time.

Core Logic and Setup

The script initialises a few key variables and helper functions to manage state and display.

  • markdownBuffer: A string that accumulates text chunks from the server.
  • output: A reference to the <div id="output"></div> element where the response gets rendered.
  • updateOutput(): A central function that takes the current markdownBuffer, parses it using marked.parse(), and injects the resulting HTML into the output element. Called repeatedly as new data arrives, creating the real-time rendering effect.
let markdownBuffer = '';
const output = document.getElementById('output');

function showStreamingText() {
    output.innerHTML = `<pre>${markdownBuffer}</pre>`;
  }

  function showFinalMarkdown() {
    output.innerHTML = marked.parse(markdownBuffer);
  }

function updateOutput() {
  const html = marked.parse(markdownBuffer);
  document.getElementById('output').innerHTML = html;
}

function runSSE() {
  // 1. Prepare for a new request
  const prompt = encodeURIComponent(document.getElementById('prompt').value);
  markdownBuffer = '';
  updateOutput();

  // 2. Create an EventSource instance
  const eventSource = new EventSource(`http://localhost:3000/sse?prompt=${prompt}`);

  // 3. Handle incoming messages
  eventSource.onmessage = (e) => {
    markdownBuffer += e.data;
    updateOutput();
  };

  // 4. Listen for the custom 'done' event
  eventSource.addEventListener('done', () => {
    eventSource.close();
  });

  // 5. Handle errors
  eventSource.onerror = (err) => {
    console.error('SSE error:', err);
    eventSource.close();
  };
}
  1. Preparation: Before starting, it clears the markdownBuffer and the output div, then retrieves the user’s prompt (encoded for safe inclusion in a URL).
  2. Connection: A new EventSource object is created, pointing at the /sse endpoint with the prompt passed as a query parameter. This automatically establishes a persistent connection.
  3. Message Handling: The onmessage listener fires every time the server sends a data: field. The text from e.data gets appended to the markdownBuffer, and updateOutput() re-renders the HTML.
  4. Completion: It listens for the custom done event the server sends when the stream finishes. On receipt, it closes the connection via eventSource.close().
  5. Error Handling: If any connection error occurs, the onerror handler logs it and closes the connection.

Streaming with the fetch() API

The runFetch() function handles a stream using the more general-purpose fetch API. More manual, but also more versatile.

async function runFetch() {
  // 1. Prepare for a new request
  const prompt = encodeURIComponent(document.getElementById('prompt').value);
  markdownBuffer = '';
  updateOutput();

  // 2. Make the fetch request and get the reader
  const res = await fetch(`http://localhost:3000/fetch?prompt=${prompt}`);
  const reader = res.body?.getReader();
  const decoder = new TextDecoder();

  if (!reader) return;

  // 3. Read the stream in a loop
  while (true) {
    const { value, done } = await reader.read();
    if (done) break; // Exit loop when stream is finished
    if (value) {
      markdownBuffer += decoder.decode(value, { stream: true });
      updateOutput();
    }
  }
}
  1. Preparation: Same as the SSE function; reset the buffer and output.
  2. Request and Reader: An await-ed fetch call to the /fetch endpoint. The key step is grabbing the ReadableStream from res.body and creating a getReader() instance to process it. A TextDecoder converts the raw Uint8Array data chunks into strings.
  3. Processing Loop: The while (true) loop continuously calls await reader.read().
    • It returns an object with value (the data chunk) and done (a boolean indicating whether the stream has ended).
    • If done is true, the loop breaks.
    • If a value exists, it’s decoded into a string, appended to the markdownBuffer, and updateOutput() renders the changes.

SSE vs. fetch with Readable Streams

FeatureServer-Sent Events (SSE)fetch with Readable Streams
SimplicityEasier to implement, especially on the frontend with the EventSource API.More manual; requires handling the stream and decoding yourself.
DirectionalityOne-way (server to client).Can be used for both sending and receiving data (e.g., in a POST request).
Error HandlingThe EventSource API has built-in error handling and automatic reconnection.Requires manual implementation of error handling and reconnection logic.
Browser SupportWidely supported in modern browsers, though some older browsers may need a polyfill.Supported in all modern browsers.
ProtocolBuilt on top of standard HTTP.A lower-level API that gives you more control over the request and response.

Conclusion

Both Server-Sent Events and the fetch API with Readable Streams work well for consuming streamed LLM responses.

  • SSE is the quicker path if you want real-time updates with minimal wiring.
  • The fetch API gives you more flexibility and control, at the cost of more manual plumbing.

The right choice depends on your specific needs. Now you know the trade-offs, so pick the one that fits.