# Consuming Streamed LLM Responses on the Frontend: A Deep Dive into SSE and Fetch

Source: https://tpiros.dev/blog/streaming-llm-responses-a-deep-dive

LLMs generate responses token by token. You can either wait for the whole thing to finish (showing a spinner while the user stares at nothing) or stream those tokens to the frontend as they land. The second option wins every time.

This article covers two ways to consume streamed LLM responses from a frontend application: **Server-Sent Events (SSE)** and the **`fetch` API with Readable Streams**.

### The Power of Streaming

When a user sends a prompt to an LLM, the model generates the response token by token. Instead of waiting for the entire response (which can take several seconds), we stream those tokens to the frontend as they become available.

The benefits:

* **Reduced Perceived Latency:** Users start seeing the response almost instantly, which makes the application feel far snappier.
* **Better User Experience:** The real-time, typewriter-like effect is a more engaging and natural way to interact with an AI.
* **Efficient Resource Utilisation:** By processing the response as a stream, we avoid holding large chunks of data in memory on both server and client.

### Method 1: Server-Sent Events (SSE)

Server-Sent Events is a simple, efficient technology for pushing real-time data from a server to a client over a single, long-lived HTTP connection. It's a natural fit for streaming LLM responses because it's one-way: server to client.

#### How it Works

1.  The client establishes a connection to a server endpoint configured to send SSE.
2.  The server keeps the connection open and sends data as "events."
3.  Each event is a simple text-based message with a specific format.

The events need to be formatted as `data: <your_data>\n\n`. Yes, you literally need the `data` keyword followed by a colon and a space before your data. That's a requirement of the SSE protocol. You also need a newline character (`\n`) at the end of each event.

### Method 2: The `fetch` API with Readable Streams

The `fetch` API also provides a way to work with streaming responses. When a server sends a response with a `Transfer-Encoding: chunked` header, the `fetch` API lets you read the response body as a `ReadableStream`.

#### How it Works

1.  The client makes a `fetch` request to a server endpoint.
2.  The server sends the response body in chunks.
3.  The client reads these chunks as they arrive using a `ReadableStream` and a `TextDecoder`.

## Example server implementation

This Node.js code creates a simple HTTP server that streams responses using the Google Generative AI SDK to a frontend application. It demonstrates both methods described above.

### Core Components and Setup

The server starts by importing necessary modules and setting up the connection to the Google AI service.

* **`import { createServer } from 'node:http';`**: The fundamental `http` module from Node.js for creating an HTTP server.
* **`import { GoogleGenAI } from '@google/genai';`**: The official Google AI SDK for Node.js, providing an interface to interact with the Gemini family of models.
* **`import url from 'node:url';`**: A utility module for parsing URL strings, so the server can read the requested path and query parameters.

```javascript
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const model = 'gemini-2.0-flash';
```

* **AI Client Initialisation**: An instance of `GoogleGenAI` is created using an API key stored in environment variables (`process.env.GEMINI_API_KEY`).
* **Model Selection**: The variable `model` is set to `'gemini-2.0-flash'`, a fast and efficient model suited to chat and real-time generation tasks.

### Server and Request Handling

The core logic lives inside the `createServer` callback, which fires for every incoming request.

```javascript
const server = createServer(async (req, res) => {
  // Set CORS Headers
  res.setHeader('Access-Control-Allow-Origin', 'http://localhost:8080');
  res.setHeader('Access-Control-Allow-Methods', 'GET');
  res.setHeader('Access-Control-Allow-Headers', 'Content-Type');

  // Parse URL and get the prompt
  const { pathname, query } = url.parse(req.url ?? '', true);
  const prompt = query.prompt || 'What is Star Wars?';
```

* **CORS Headers**: The `res.setHeader` calls enable Cross-Origin Resource Sharing. They explicitly permit a frontend running on `http://localhost:8080` to access this server on a different port (`3000`).
* **URL Parsing & Prompt Extraction**: The server parses the request URL to determine the endpoint (`pathname`) and pulls the user's `prompt` from the query string. If no prompt is provided, it falls back to a default value.

### Interacting with the Gemini API

The server uses `generateContentStream` to get a real-time stream from the AI.

```javascript
  const response = await ai.models.generateContentStream({
    model,
    contents: prompt,
    config: {
      systemInstruction: 'Please keep your response short and concise. Maximum 200 words.'
    }
  });
```

This is the key interaction with the Gemini API. Instead of waiting for the full response, `generateContentStream` returns an *asynchronous iterable*. The server can loop through response chunks as the model generates them. A `systemInstruction` is bolted on to guide the AI's tone and length.

### The Streaming Endpoints

The server logic then branches based on the requested `pathname`.

#### 1. The `/sse` Endpoint (Server-Sent Events)

This endpoint is built for clients using the `EventSource` API (more on this shortly).

```javascript
if (pathname === '/sse') {
  res.writeHead(200, {
    'Content-Type': 'text/event-stream',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
  });

  for await (const chunk of response) {
    res.write(`data: ${chunk.text}\n\n`);
  }

  res.write('event: done\ndata: [DONE]\n\n');
  res.end();
}
```

* **Headers**: The `res.writeHead` method sends a `200 OK` status with three critical headers:
    * `Content-Type: text/event-stream`: Tells the client to process the response as an event stream.
    * `Cache-Control: no-cache`: Ensures the client always gets a fresh response.
    * `Connection: keep-alive`: Keeps the HTTP connection open to push multiple events.
* **Event Formatting**: Inside the `for await...of` loop, each chunk from the AI is formatted per the SSE protocol: `data: <text_chunk>\n\n`. The `data:` prefix is mandatory; the double newline `\n\n` signals the end of a single event.
* **Custom 'done' Event**: After the AI stream concludes, a final custom event (`event: done`) signals to the frontend that transmission is complete.

#### 2. The `/fetch` Endpoint (Chunked Response)

This endpoint provides a raw text stream, suited for consumption with the `fetch` API and `ReadableStream`.

```javascript
else if (pathname === '/fetch') {
  res.writeHead(200, {
    'Content-Type': 'text/plain',
    'Transfer-Encoding': 'chunked',
    'Cache-Control': 'no-cache',
  });

  for await (const chunk of response) {
    res.write(chunk.text);
  }

  res.end();
}
```

* **Headers**: Different from SSE:
    * `Content-Type: text/plain`: The data is plain text.
    * `Transfer-Encoding: chunked`: The key header telling the client the response body arrives in a series of chunks rather than all at once.
* **Data Transmission**: The loop iterates through the AI's response, writing `chunk.text` *directly* to the response stream with no special formatting.
* **End of Stream**: `res.end()` closes the connection and signals the end of the chunked response.

### Starting the Server

The `server.listen` method starts the server.

```javascript
server.listen(3000, () => {
  console.log('Server running at http://localhost:3000');
  console.log('Try /sse or /fetch');
});
```

From your CLI: `node --experimental-strip-types --watch --env-file=.env server.ts`

# Frontend Implementation

Now the frontend. It provides a user interface to interact with the streaming server, letting users enter a prompt and choose one of two methods (**SSE** or **`fetch()`**) to stream the response from the Gemini API. The code uses `marked.js` to render incoming Markdown as formatted HTML in real time.

### Core Logic and Setup

The script initialises a few key variables and helper functions to manage state and display.

* `markdownBuffer`: A string that accumulates text chunks from the server.
* `output`: A reference to the `<div id="output"></div>` element where the response gets rendered.
* `updateOutput()`: A central function that takes the current `markdownBuffer`, parses it using `marked.parse()`, and injects the resulting HTML into the `output` element. Called repeatedly as new data arrives, creating the real-time rendering effect.

```javascript
let markdownBuffer = '';
const output = document.getElementById('output');

function showStreamingText() {
    output.innerHTML = `<pre>${markdownBuffer}</pre>`;
  }

  function showFinalMarkdown() {
    output.innerHTML = marked.parse(markdownBuffer);
  }

function updateOutput() {
  const html = marked.parse(markdownBuffer);
  document.getElementById('output').innerHTML = html;
}

function runSSE() {
  // 1. Prepare for a new request
  const prompt = encodeURIComponent(document.getElementById('prompt').value);
  markdownBuffer = '';
  updateOutput();

  // 2. Create an EventSource instance
  const eventSource = new EventSource(`http://localhost:3000/sse?prompt=${prompt}`);

  // 3. Handle incoming messages
  eventSource.onmessage = (e) => {
    markdownBuffer += e.data;
    updateOutput();
  };

  // 4. Listen for the custom 'done' event
  eventSource.addEventListener('done', () => {
    eventSource.close();
  });

  // 5. Handle errors
  eventSource.onerror = (err) => {
    console.error('SSE error:', err);
    eventSource.close();
  };
}
```

1.  **Preparation**: Before starting, it clears the `markdownBuffer` and the `output` div, then retrieves the user's prompt (encoded for safe inclusion in a URL).
2.  **Connection**: A new `EventSource` object is created, pointing at the `/sse` endpoint with the prompt passed as a query parameter. This automatically establishes a persistent connection.
3.  **Message Handling**: The `onmessage` listener fires every time the server sends a `data:` field. The text from `e.data` gets appended to the `markdownBuffer`, and `updateOutput()` re-renders the HTML.
4.  **Completion**: It listens for the custom `done` event the server sends when the stream finishes. On receipt, it closes the connection via `eventSource.close()`.
5.  **Error Handling**: If any connection error occurs, the `onerror` handler logs it and closes the connection.

### Streaming with the `fetch()` API

The `runFetch()` function handles a stream using the more general-purpose `fetch` API. More manual, but also more versatile.

```javascript
async function runFetch() {
  // 1. Prepare for a new request
  const prompt = encodeURIComponent(document.getElementById('prompt').value);
  markdownBuffer = '';
  updateOutput();

  // 2. Make the fetch request and get the reader
  const res = await fetch(`http://localhost:3000/fetch?prompt=${prompt}`);
  const reader = res.body?.getReader();
  const decoder = new TextDecoder();

  if (!reader) return;

  // 3. Read the stream in a loop
  while (true) {
    const { value, done } = await reader.read();
    if (done) break; // Exit loop when stream is finished
    if (value) {
      markdownBuffer += decoder.decode(value, { stream: true });
      updateOutput();
    }
  }
}
```

1.  **Preparation**: Same as the SSE function; reset the buffer and output.
2.  **Request and Reader**: An `await`-ed `fetch` call to the `/fetch` endpoint. The key step is grabbing the `ReadableStream` from `res.body` and creating a `getReader()` instance to process it. A `TextDecoder` converts the raw `Uint8Array` data chunks into strings.
3.  **Processing Loop**: The `while (true)` loop continuously calls `await reader.read()`.
    * It returns an object with `value` (the data chunk) and `done` (a boolean indicating whether the stream has ended).
    * If `done` is `true`, the loop breaks.
    * If a `value` exists, it's decoded into a string, appended to the `markdownBuffer`, and `updateOutput()` renders the changes.

### SSE vs. `fetch` with Readable Streams

| Feature | Server-Sent Events (SSE) | `fetch` with Readable Streams |
| :--- | :--- | :--- |
| **Simplicity** | Easier to implement, especially on the frontend with the `EventSource` API. | More manual; requires handling the stream and decoding yourself. |
| **Directionality**| One-way (server to client). | Can be used for both sending and receiving data (e.g., in a `POST` request). |
| **Error Handling** | The `EventSource` API has built-in error handling and automatic reconnection. | Requires manual implementation of error handling and reconnection logic. |
| **Browser Support** | Widely supported in modern browsers, though some older browsers may need a polyfill. | Supported in all modern browsers. |
| **Protocol** | Built on top of standard HTTP. | A lower-level API that gives you more control over the request and response. |

### Conclusion

Both Server-Sent Events and the `fetch` API with Readable Streams work well for consuming streamed LLM responses.

* **SSE** is the quicker path if you want real-time updates with minimal wiring.
* **The `fetch` API** gives you more flexibility and control, at the cost of more manual plumbing.

The right choice depends on your specific needs. Now you know the trade-offs, so pick the one that fits.
