Consuming Streamed LLM Responses on the Frontend: A Deep Dive into SSE and Fetch

In the rapidly evolving landscape of Large Language Models (LLMs), providing a seamless and real-time user experience is paramount. One of the most effective ways to achieve this is by streaming the LLM's response to the frontend as it's being generated. This not only reduces the perceived latency but also creates a more engaging and interactive application - as opposed to showing a loading indicator or waiting for the entire response to be generated we can display the response as it's being generated.

This article will provide a comprehensive guide on how to consume streamed LLM responses from a frontend application. We will explore two popular methods: Server-Sent Events (SSE) and the fetch API with Readable Streams.

The Power of Streaming

When a user sends a prompt to an LLM, the model generates the response token by token. Instead of waiting for the entire response to be generated, which can take several seconds, we can stream these tokens to the frontend as they become available.

The benefits of this approach are manifold:

  • Reduced Perceived Latency: Users start seeing the response almost instantly, which makes the application feel much more responsive.
  • Improved User Experience: The real-time, typewriter-like effect of the text appearing on the screen is a more engaging and natural way to interact with an AI.
  • Efficient Resource Utilization: By processing the response as a stream, we can avoid holding large amounts of data in memory on both the server and the client.

Method 1: Server-Sent Events (SSE)

Server-Sent Events (SSE) is a simple and efficient technology for pushing real-time data from a server to a client over a single, long-lived HTTP connection. It's a perfect fit for streaming LLM responses because it's a one-way communication channel from the server to the client.

How it Works

  1. The client establishes a connection to a server endpoint that is configured to send SSE.
  2. The server keeps the connection open and sends data to the client in the form of "events."
  3. Each event is a simple text-based message with a specific format.

It's important to note that the events need to be formatted as data: <your_data>\n\n. Yes, you literally need the data keyword followed by a colon and a space before your data. This is a requirement of the SSE protocol. And it's also important to include a newline character (\n) at the end of each event.

Method 2: The fetch API with Readable Streams

The fetch API, a modern and powerful tool for making HTTP requests, also provides a way to work with streaming responses. When a server sends a response with a Transfer-Encoding: chunked header, the fetch API allows you to read the response body as a ReadableStream.

How it Works

  1. The client makes a fetch request to a server endpoint.
  2. The server sends the response body in chunks.
  3. The client can read these chunks as they arrive using a ReadableStream and a TextDecoder.

Example server implementation

This Node.js code creates a simple HTTP server designed to act as a backend that streams responses using the Google Generative AI SDK to a frontend application. It showcases the two distinct methods discussed earlier for delivering this real-time data.

Core Components and Setup

The server begins by importing necessary modules and setting up the connection to the Google AI service.

  • import { createServer } from 'node:http';: This line imports the fundamental http module from Node.js, which is essential for creating an HTTP server.
  • import { GoogleGenAI } from '@google/genai';: This imports the official Google AI SDK for Node.js, providing a straightforward interface to interact with the Gemini family of models.
  • import url from 'node:url';: This utility module is used for parsing URL strings, which allows the server to easily read the requested path and any query parameters.
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const model = 'gemini-2.0-flash';
  • AI Client Initialisation: An instance of GoogleGenAI is created using an API key stored in the system's environment variables (process.env.GEMINI_API_KEY).
  • Model Selection: The variable model is set to 'gemini-2.0-flash', a fast and efficient model ideal for chat and real-time generation tasks.

Server and Request Handling

The core logic resides within the createServer callback, which executes for every incoming request.

const server = createServer(async (req, res) => {
// Set CORS Headers
res.setHeader('Access-Control-Allow-Origin', 'http://localhost:8080');
res.setHeader('Access-Control-Allow-Methods', 'GET');
res.setHeader('Access-Control-Allow-Headers', 'Content-Type');

// Parse URL and get the prompt
const { pathname, query } = url.parse(req.url ?? '', true);
const prompt = query.prompt || 'What is Star Wars?';
  • CORS Headers: The res.setHeader calls are crucial for enabling Cross-Origin Resource Sharing (CORS). They explicitly permit a frontend running on http://localhost:8080 to access this server, which is running on a different port (3000).
  • URL Parsing & Prompt Extraction: The server parses the request URL to determine the endpoint (pathname) and extracts the user's prompt from the query string. If no prompt is provided, it uses a default value.

Interacting with the Gemini API

The server uses the generateContentStream method to get a real-time stream from the AI.

  const response = await ai.models.generateContentStream({
model,
contents: prompt,
config: {
systemInstruction: 'Please keep your response short and concise. Maximum 200 words.'
}
});

This is the key interaction with the Gemini API. Instead of waiting for the full response, generateContentStream returns an asynchronous iterable. This allows the server to loop through the response chunks as they are generated by the model, enabling the streaming functionality. A systemInstruction is also included to guide the AI's tone and length.

The Streaming Endpoints

The server logic then branches based on the requested pathname.

1. The /sse Endpoint (Server-Sent Events)

This endpoint is designed for clients that use the EventSource API (more on this later).

if (pathname === '/sse') {
res.writeHead(200, {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
});

for await (const chunk of response) {
res.write(`data: ${chunk.text}\n\n`);
}

res.write('event: done\ndata: [DONE]\n\n');
res.end();
}
  • Headers: The res.writeHead method sends a 200 OK status and sets three critical headers:
    • Content-Type: text/event-stream: Informs the client to process the response as an event stream.
    • Cache-Control: no-cache: Ensures the client always gets a fresh response.
    • Connection: keep-alive: Keeps the HTTP connection open to push multiple events.
  • Event Formatting: Inside the for await...of loop, each chunk received from the AI is formatted according to the SSE protocol: data: <text_chunk>\n\n. The data: prefix is mandatory, and the double newline \n\n signals the end of a single event.
  • Custom 'done' Event: After the stream from the AI concludes, a final, custom event (event: done) is sent to explicitly signal to the frontend that the transmission is complete.

2. The /fetch Endpoint (Chunked Response)

This endpoint provides a more generic stream of raw text, suitable for consumption with the fetch API and ReadableStream.

else if (pathname === '/fetch') {
res.writeHead(200, {
'Content-Type': 'text/plain',
'Transfer-Encoding': 'chunked',
'Cache-Control': 'no-cache',
});

for await (const chunk of response) {
res.write(chunk.text);
}

res.end();
}
  • Headers: The headers here are different from SSE:
    • Content-Type: text/plain: The data is just plain text.
    • Transfer-Encoding: chunked: This is the key header that informs the client that the response body will arrive in a series of chunks rather than all at once.
  • Data Transmission: The loop iterates through the AI's response, but this time it writes the chunk.text directly to the response stream without any special formatting.
  • End of Stream: res.end() is called after the loop finishes, which closes the connection and signals the end of the chunked response.

Starting the Server

Finally, the server.listen method starts the server and makes it ready to accept connections.

server.listen(3000, () => {
console.log('Server running at http://localhost:3000');
console.log('Try /sse or /fetch');
});

From your CLI you can run node --experimental-strip-types --watch --env-file=.env server.ts which will start the server.

Frontend Implementation

Now let's take a look at the frontend which provides a user interface to interact with the streaming server. It allows a user to enter a prompt and then choose one of two methods - Server-Sent Events (SSE) or fetch() - to stream the response from the Gemini API. The code also uses the marked.js library to render the incoming Markdown response as formatted HTML in real-time.

Core Logic and Setup

The script initialises a few key variables and helper functions to manage the state and display of the output.

  • markdownBuffer: A string variable that accumulates the text chunks received from the server.
  • output: A reference to the <div id="output"></div> element where the response will be rendered.
  • updateOutput(): A central function that takes the current content of markdownBuffer, parses it using marked.parse(), and then injects the resulting HTML into the output element. This function is called repeatedly as new data arrives, creating a real-time rendering effect.

Streaming with Server-Sent Events (SSE)

The runSSE() function handles the connection using the browser's built-in EventSource API, which is designed specifically for this type of one-way data stream from a server.

function runSSE() {
// 1. Prepare for a new request
const prompt = encodeURIComponent(document.getElementById('prompt').value);
markdownBuffer = '';
updateOutput();

// 2. Create an EventSource instance
const eventSource = new EventSource(`http://localhost:3000/sse?prompt=${prompt}`);

// 3. Handle incoming messages
eventSource.onmessage = (e) => {
markdownBuffer += e.data;
updateOutput();
};

// 4. Listen for the custom 'done' event
eventSource.addEventListener('done', () => {
eventSource.close();
});

// 5. Handle errors
eventSource.onerror = (err) => {
console.error('SSE error:', err);
eventSource.close();
};
}
  1. Preparation: Before starting, it clears the markdownBuffer and the output div and retrieves the user's prompt, encoding it for safe inclusion in a URL.
  2. Connection: It creates a new EventSource object, pointing it to the /sse endpoint on the server, with the prompt passed as a query parameter. This automatically establishes a persistent connection.
  3. Message Handling: The onmessage event listener is the primary handler. It's triggered every time the server sends a data: field. The text from e.data is appended to the markdownBuffer, and updateOutput() is called to re-render the HTML.
  4. Completion: It listens for the custom done event that the server sends when the stream is finished. Upon receiving this event, it closes the connection using eventSource.close().
  5. Error Handling: If any connection error occurs, the onerror handler logs the error and closes the connection to prevent further issues.

Streaming with the fetch() API

The runFetch() function demonstrates how to handle a stream using the more general-purpose fetch API. This approach is more manual but also more versatile.

async function runFetch() {
// 1. Prepare for a new request
const prompt = encodeURIComponent(document.getElementById('prompt').value);
markdownBuffer = '';
updateOutput();

// 2. Make the fetch request and get the reader
const res = await fetch(`http://localhost:3000/fetch?prompt=${prompt}`);
const reader = res.body?.getReader();
const decoder = new TextDecoder();

if (!reader) return;

// 3. Read the stream in a loop
while (true) {
const { value, done } = await reader.read();
if (done) break; // Exit loop when stream is finished
if (value) {
markdownBuffer += decoder.decode(value, { stream: true });
updateOutput();
}
}
}
  1. Preparation: Similar to the SSE function, it first resets the buffer and output.
  2. Request and Reader: It makes an await-ed fetch call to the /fetch endpoint. The key step here is getting the ReadableStream from res.body and creating a getReader() instance to process it. A TextDecoder is also initialised to convert the raw Uint8Array data chunks into strings.
  3. Processing Loop: The while (true) loop continuously calls await reader.read().
    • This call returns an object with two properties: value (the chunk of data) and done (a boolean indicating if the stream has ended).
    • If done is true, the loop breaks.
    • If a value exists, it's decoded into a string, appended to the markdownBuffer, and the updateOutput() function is called to render the changes.

SSE vs. fetch with Readable Streams

Feature Server-Sent Events (SSE) fetch with Readable Streams
Simplicity Easier to implement, especially on the frontend with the EventSource API. More complex, requires manual handling of the stream and decoding.
Directionality One-way (server to client). Can be used for both sending and receiving data (e.g., in a POST request).
Error Handling The EventSource API has built-in error handling and automatic reconnection. Requires manual implementation of error handling and reconnection logic.
Browser Support Widely supported in modern browsers, but some older browsers may require a polyfill. Supported in all modern browsers.
Protocol Built on top of standard HTTP. A lower-level API that gives you more control over the request and response.

Conclusion

Both Server-Sent Events and the fetch API with Readable Streams are excellent choices for consuming streamed LLM responses from a frontend.

  • SSE is a great option if you're looking for a simple and straightforward way to implement real-time updates.
  • The fetch API provides more flexibility and control, but it also comes with a steeper learning curve.

The best choice for your application will depend on your specific needs and requirements. By understanding the pros and cons of each approach, you can make an informed decision and build a more responsive and engaging user experience for your LLM-powered application.