Consuming Streamed LLM Responses on the Frontend: A Deep Dive into SSE and Fetch
LLMs generate responses token by token. You can either wait for the whole thing to finish (showing a spinner while the user stares at nothing) or stream those tokens to the frontend as they land. The second option wins every time.
This article covers two ways to consume streamed LLM responses from a frontend application: Server-Sent Events (SSE) and the fetch API with Readable Streams.
The Power of Streaming
When a user sends a prompt to an LLM, the model generates the response token by token. Instead of waiting for the entire response (which can take several seconds), we stream those tokens to the frontend as they become available.
The benefits:
- Reduced Perceived Latency: Users start seeing the response almost instantly, which makes the application feel far snappier.
- Better User Experience: The real-time, typewriter-like effect is a more engaging and natural way to interact with an AI.
- Efficient Resource Utilisation: By processing the response as a stream, we avoid holding large chunks of data in memory on both server and client.
Method 1: Server-Sent Events (SSE)
Server-Sent Events is a simple, efficient technology for pushing real-time data from a server to a client over a single, long-lived HTTP connection. It’s a natural fit for streaming LLM responses because it’s one-way: server to client.
How it Works
- The client establishes a connection to a server endpoint configured to send SSE.
- The server keeps the connection open and sends data as “events.”
- Each event is a simple text-based message with a specific format.
The events need to be formatted as data: <your_data>\n\n. Yes, you literally need the data keyword followed by a colon and a space before your data. That’s a requirement of the SSE protocol. You also need a newline character (\n) at the end of each event.
Method 2: The fetch API with Readable Streams
The fetch API also provides a way to work with streaming responses. When a server sends a response with a Transfer-Encoding: chunked header, the fetch API lets you read the response body as a ReadableStream.
How it Works
- The client makes a
fetchrequest to a server endpoint. - The server sends the response body in chunks.
- The client reads these chunks as they arrive using a
ReadableStreamand aTextDecoder.
Example server implementation
This Node.js code creates a simple HTTP server that streams responses using the Google Generative AI SDK to a frontend application. It demonstrates both methods described above.
Core Components and Setup
The server starts by importing necessary modules and setting up the connection to the Google AI service.
import { createServer } from 'node:http';: The fundamentalhttpmodule from Node.js for creating an HTTP server.import { GoogleGenAI } from '@google/genai';: The official Google AI SDK for Node.js, providing an interface to interact with the Gemini family of models.import url from 'node:url';: A utility module for parsing URL strings, so the server can read the requested path and query parameters.
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const model = 'gemini-2.0-flash';
- AI Client Initialisation: An instance of
GoogleGenAIis created using an API key stored in environment variables (process.env.GEMINI_API_KEY). - Model Selection: The variable
modelis set to'gemini-2.0-flash', a fast and efficient model suited to chat and real-time generation tasks.
Server and Request Handling
The core logic lives inside the createServer callback, which fires for every incoming request.
const server = createServer(async (req, res) => {
// Set CORS Headers
res.setHeader('Access-Control-Allow-Origin', 'http://localhost:8080');
res.setHeader('Access-Control-Allow-Methods', 'GET');
res.setHeader('Access-Control-Allow-Headers', 'Content-Type');
// Parse URL and get the prompt
const { pathname, query } = url.parse(req.url ?? '', true);
const prompt = query.prompt || 'What is Star Wars?';
- CORS Headers: The
res.setHeadercalls enable Cross-Origin Resource Sharing. They explicitly permit a frontend running onhttp://localhost:8080to access this server on a different port (3000). - URL Parsing & Prompt Extraction: The server parses the request URL to determine the endpoint (
pathname) and pulls the user’spromptfrom the query string. If no prompt is provided, it falls back to a default value.
Interacting with the Gemini API
The server uses generateContentStream to get a real-time stream from the AI.
const response = await ai.models.generateContentStream({
model,
contents: prompt,
config: {
systemInstruction: 'Please keep your response short and concise. Maximum 200 words.'
}
});
This is the key interaction with the Gemini API. Instead of waiting for the full response, generateContentStream returns an asynchronous iterable. The server can loop through response chunks as the model generates them. A systemInstruction is bolted on to guide the AI’s tone and length.
The Streaming Endpoints
The server logic then branches based on the requested pathname.
1. The /sse Endpoint (Server-Sent Events)
This endpoint is built for clients using the EventSource API (more on this shortly).
if (pathname === '/sse') {
res.writeHead(200, {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
});
for await (const chunk of response) {
res.write(`data: ${chunk.text}\n\n`);
}
res.write('event: done\ndata: [DONE]\n\n');
res.end();
}
- Headers: The
res.writeHeadmethod sends a200 OKstatus with three critical headers:Content-Type: text/event-stream: Tells the client to process the response as an event stream.Cache-Control: no-cache: Ensures the client always gets a fresh response.Connection: keep-alive: Keeps the HTTP connection open to push multiple events.
- Event Formatting: Inside the
for await...ofloop, each chunk from the AI is formatted per the SSE protocol:data: <text_chunk>\n\n. Thedata:prefix is mandatory; the double newline\n\nsignals the end of a single event. - Custom ‘done’ Event: After the AI stream concludes, a final custom event (
event: done) signals to the frontend that transmission is complete.
2. The /fetch Endpoint (Chunked Response)
This endpoint provides a raw text stream, suited for consumption with the fetch API and ReadableStream.
else if (pathname === '/fetch') {
res.writeHead(200, {
'Content-Type': 'text/plain',
'Transfer-Encoding': 'chunked',
'Cache-Control': 'no-cache',
});
for await (const chunk of response) {
res.write(chunk.text);
}
res.end();
}
- Headers: Different from SSE:
Content-Type: text/plain: The data is plain text.Transfer-Encoding: chunked: The key header telling the client the response body arrives in a series of chunks rather than all at once.
- Data Transmission: The loop iterates through the AI’s response, writing
chunk.textdirectly to the response stream with no special formatting. - End of Stream:
res.end()closes the connection and signals the end of the chunked response.
Starting the Server
The server.listen method starts the server.
server.listen(3000, () => {
console.log('Server running at http://localhost:3000');
console.log('Try /sse or /fetch');
});
From your CLI: node --experimental-strip-types --watch --env-file=.env server.ts
Frontend Implementation
Now the frontend. It provides a user interface to interact with the streaming server, letting users enter a prompt and choose one of two methods (SSE or fetch()) to stream the response from the Gemini API. The code uses marked.js to render incoming Markdown as formatted HTML in real time.
Core Logic and Setup
The script initialises a few key variables and helper functions to manage state and display.
markdownBuffer: A string that accumulates text chunks from the server.output: A reference to the<div id="output"></div>element where the response gets rendered.updateOutput(): A central function that takes the currentmarkdownBuffer, parses it usingmarked.parse(), and injects the resulting HTML into theoutputelement. Called repeatedly as new data arrives, creating the real-time rendering effect.
let markdownBuffer = '';
const output = document.getElementById('output');
function showStreamingText() {
output.innerHTML = `<pre>${markdownBuffer}</pre>`;
}
function showFinalMarkdown() {
output.innerHTML = marked.parse(markdownBuffer);
}
function updateOutput() {
const html = marked.parse(markdownBuffer);
document.getElementById('output').innerHTML = html;
}
function runSSE() {
// 1. Prepare for a new request
const prompt = encodeURIComponent(document.getElementById('prompt').value);
markdownBuffer = '';
updateOutput();
// 2. Create an EventSource instance
const eventSource = new EventSource(`http://localhost:3000/sse?prompt=${prompt}`);
// 3. Handle incoming messages
eventSource.onmessage = (e) => {
markdownBuffer += e.data;
updateOutput();
};
// 4. Listen for the custom 'done' event
eventSource.addEventListener('done', () => {
eventSource.close();
});
// 5. Handle errors
eventSource.onerror = (err) => {
console.error('SSE error:', err);
eventSource.close();
};
}
- Preparation: Before starting, it clears the
markdownBufferand theoutputdiv, then retrieves the user’s prompt (encoded for safe inclusion in a URL). - Connection: A new
EventSourceobject is created, pointing at the/sseendpoint with the prompt passed as a query parameter. This automatically establishes a persistent connection. - Message Handling: The
onmessagelistener fires every time the server sends adata:field. The text frome.datagets appended to themarkdownBuffer, andupdateOutput()re-renders the HTML. - Completion: It listens for the custom
doneevent the server sends when the stream finishes. On receipt, it closes the connection viaeventSource.close(). - Error Handling: If any connection error occurs, the
onerrorhandler logs it and closes the connection.
Streaming with the fetch() API
The runFetch() function handles a stream using the more general-purpose fetch API. More manual, but also more versatile.
async function runFetch() {
// 1. Prepare for a new request
const prompt = encodeURIComponent(document.getElementById('prompt').value);
markdownBuffer = '';
updateOutput();
// 2. Make the fetch request and get the reader
const res = await fetch(`http://localhost:3000/fetch?prompt=${prompt}`);
const reader = res.body?.getReader();
const decoder = new TextDecoder();
if (!reader) return;
// 3. Read the stream in a loop
while (true) {
const { value, done } = await reader.read();
if (done) break; // Exit loop when stream is finished
if (value) {
markdownBuffer += decoder.decode(value, { stream: true });
updateOutput();
}
}
}
- Preparation: Same as the SSE function; reset the buffer and output.
- Request and Reader: An
await-edfetchcall to the/fetchendpoint. The key step is grabbing theReadableStreamfromres.bodyand creating agetReader()instance to process it. ATextDecoderconverts the rawUint8Arraydata chunks into strings. - Processing Loop: The
while (true)loop continuously callsawait reader.read().- It returns an object with
value(the data chunk) anddone(a boolean indicating whether the stream has ended). - If
doneistrue, the loop breaks. - If a
valueexists, it’s decoded into a string, appended to themarkdownBuffer, andupdateOutput()renders the changes.
- It returns an object with
SSE vs. fetch with Readable Streams
| Feature | Server-Sent Events (SSE) | fetch with Readable Streams |
|---|---|---|
| Simplicity | Easier to implement, especially on the frontend with the EventSource API. | More manual; requires handling the stream and decoding yourself. |
| Directionality | One-way (server to client). | Can be used for both sending and receiving data (e.g., in a POST request). |
| Error Handling | The EventSource API has built-in error handling and automatic reconnection. | Requires manual implementation of error handling and reconnection logic. |
| Browser Support | Widely supported in modern browsers, though some older browsers may need a polyfill. | Supported in all modern browsers. |
| Protocol | Built on top of standard HTTP. | A lower-level API that gives you more control over the request and response. |
Conclusion
Both Server-Sent Events and the fetch API with Readable Streams work well for consuming streamed LLM responses.
- SSE is the quicker path if you want real-time updates with minimal wiring.
- The
fetchAPI gives you more flexibility and control, at the cost of more manual plumbing.
The right choice depends on your specific needs. Now you know the trade-offs, so pick the one that fits.