In the rapidly evolving landscape of Large Language Models (LLMs), providing a seamless and real-time user experience is paramount. One of the most effective ways to achieve this is by streaming the LLM's response to the frontend as it's being generated. This not only reduces the perceived latency but also creates a more engaging and interactive application - as opposed to showing a loading indicator or waiting for the entire response to be generated we can display the response as it's being generated.
This article will provide a comprehensive guide on how to consume streamed LLM responses from a frontend application. We will explore two popular methods: Server-Sent Events (SSE) and the fetch
API with Readable Streams.
When a user sends a prompt to an LLM, the model generates the response token by token. Instead of waiting for the entire response to be generated, which can take several seconds, we can stream these tokens to the frontend as they become available.
The benefits of this approach are manifold:
Server-Sent Events (SSE) is a simple and efficient technology for pushing real-time data from a server to a client over a single, long-lived HTTP connection. It's a perfect fit for streaming LLM responses because it's a one-way communication channel from the server to the client.
It's important to note that the events need to be formatted as data: <your_data>\n\n
. Yes, you literally need the data
keyword followed by a colon and a space before your data. This is a requirement of the SSE protocol. And it's also important to include a newline character (\n
) at the end of each event.
fetch
API with Readable StreamsThe fetch
API, a modern and powerful tool for making HTTP requests, also provides a way to work with streaming responses. When a server sends a response with a Transfer-Encoding: chunked
header, the fetch
API allows you to read the response body as a ReadableStream
.
fetch
request to a server endpoint.ReadableStream
and a TextDecoder
.This Node.js code creates a simple HTTP server designed to act as a backend that streams responses using the Google Generative AI SDK to a frontend application. It showcases the two distinct methods discussed earlier for delivering this real-time data.
The server begins by importing necessary modules and setting up the connection to the Google AI service.
import { createServer } from 'node:http';
: This line imports the fundamental http
module from Node.js, which is essential for creating an HTTP server.import { GoogleGenAI } from '@google/genai';
: This imports the official Google AI SDK for Node.js, providing a straightforward interface to interact with the Gemini family of models.import url from 'node:url';
: This utility module is used for parsing URL strings, which allows the server to easily read the requested path and any query parameters.const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const model = 'gemini-2.0-flash';
GoogleGenAI
is created using an API key stored in the system's environment variables (process.env.GEMINI_API_KEY
).model
is set to 'gemini-2.0-flash'
, a fast and efficient model ideal for chat and real-time generation tasks.The core logic resides within the createServer
callback, which executes for every incoming request.
const server = createServer(async (req, res) => {
// Set CORS Headers
res.setHeader('Access-Control-Allow-Origin', 'http://localhost:8080');
res.setHeader('Access-Control-Allow-Methods', 'GET');
res.setHeader('Access-Control-Allow-Headers', 'Content-Type');
// Parse URL and get the prompt
const { pathname, query } = url.parse(req.url ?? '', true);
const prompt = query.prompt || 'What is Star Wars?';
res.setHeader
calls are crucial for enabling Cross-Origin Resource Sharing (CORS). They explicitly permit a frontend running on http://localhost:8080
to access this server, which is running on a different port (3000
).pathname
) and extracts the user's prompt
from the query string. If no prompt is provided, it uses a default value.The server uses the generateContentStream
method to get a real-time stream from the AI.
const response = await ai.models.generateContentStream({
model,
contents: prompt,
config: {
systemInstruction: 'Please keep your response short and concise. Maximum 200 words.'
}
});
This is the key interaction with the Gemini API. Instead of waiting for the full response, generateContentStream
returns an asynchronous iterable. This allows the server to loop through the response chunks as they are generated by the model, enabling the streaming functionality. A systemInstruction
is also included to guide the AI's tone and length.
The server logic then branches based on the requested pathname
.
/sse
Endpoint (Server-Sent Events)This endpoint is designed for clients that use the EventSource
API (more on this later).
if (pathname === '/sse') {
res.writeHead(200, {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
});
for await (const chunk of response) {
res.write(`data: ${chunk.text}\n\n`);
}
res.write('event: done\ndata: [DONE]\n\n');
res.end();
}
res.writeHead
method sends a 200 OK
status and sets three critical headers:
Content-Type: text/event-stream
: Informs the client to process the response as an event stream.Cache-Control: no-cache
: Ensures the client always gets a fresh response.Connection: keep-alive
: Keeps the HTTP connection open to push multiple events.for await...of
loop, each chunk received from the AI is formatted according to the SSE protocol: data: <text_chunk>\n\n
. The data:
prefix is mandatory, and the double newline \n\n
signals the end of a single event.event: done
) is sent to explicitly signal to the frontend that the transmission is complete./fetch
Endpoint (Chunked Response)This endpoint provides a more generic stream of raw text, suitable for consumption with the fetch
API and ReadableStream
.
else if (pathname === '/fetch') {
res.writeHead(200, {
'Content-Type': 'text/plain',
'Transfer-Encoding': 'chunked',
'Cache-Control': 'no-cache',
});
for await (const chunk of response) {
res.write(chunk.text);
}
res.end();
}
Content-Type: text/plain
: The data is just plain text.Transfer-Encoding: chunked
: This is the key header that informs the client that the response body will arrive in a series of chunks rather than all at once.chunk.text
directly to the response stream without any special formatting.res.end()
is called after the loop finishes, which closes the connection and signals the end of the chunked response.Finally, the server.listen
method starts the server and makes it ready to accept connections.
server.listen(3000, () => {
console.log('Server running at http://localhost:3000');
console.log('Try /sse or /fetch');
});
From your CLI you can run node --experimental-strip-types --watch --env-file=.env server.ts
which will start the server.
Now let's take a look at the frontend which provides a user interface to interact with the streaming server. It allows a user to enter a prompt and then choose one of two methods - Server-Sent Events (SSE) or fetch()
- to stream the response from the Gemini API. The code also uses the marked.js
library to render the incoming Markdown response as formatted HTML in real-time.
The script initialises a few key variables and helper functions to manage the state and display of the output.
markdownBuffer
: A string variable that accumulates the text chunks received from the server.output
: A reference to the <div id="output"></div>
element where the response will be rendered.updateOutput()
: A central function that takes the current content of markdownBuffer
, parses it using marked.parse()
, and then injects the resulting HTML into the output
element. This function is called repeatedly as new data arrives, creating a real-time rendering effect.The runSSE()
function handles the connection using the browser's built-in EventSource
API, which is designed specifically for this type of one-way data stream from a server.
function runSSE() {
// 1. Prepare for a new request
const prompt = encodeURIComponent(document.getElementById('prompt').value);
markdownBuffer = '';
updateOutput();
// 2. Create an EventSource instance
const eventSource = new EventSource(`http://localhost:3000/sse?prompt=${prompt}`);
// 3. Handle incoming messages
eventSource.onmessage = (e) => {
markdownBuffer += e.data;
updateOutput();
};
// 4. Listen for the custom 'done' event
eventSource.addEventListener('done', () => {
eventSource.close();
});
// 5. Handle errors
eventSource.onerror = (err) => {
console.error('SSE error:', err);
eventSource.close();
};
}
markdownBuffer
and the output
div and retrieves the user's prompt, encoding it for safe inclusion in a URL.EventSource
object, pointing it to the /sse
endpoint on the server, with the prompt passed as a query parameter. This automatically establishes a persistent connection.onmessage
event listener is the primary handler. It's triggered every time the server sends a data:
field. The text from e.data
is appended to the markdownBuffer
, and updateOutput()
is called to re-render the HTML.done
event that the server sends when the stream is finished. Upon receiving this event, it closes the connection using eventSource.close()
.onerror
handler logs the error and closes the connection to prevent further issues.fetch()
APIThe runFetch()
function demonstrates how to handle a stream using the more general-purpose fetch
API. This approach is more manual but also more versatile.
async function runFetch() {
// 1. Prepare for a new request
const prompt = encodeURIComponent(document.getElementById('prompt').value);
markdownBuffer = '';
updateOutput();
// 2. Make the fetch request and get the reader
const res = await fetch(`http://localhost:3000/fetch?prompt=${prompt}`);
const reader = res.body?.getReader();
const decoder = new TextDecoder();
if (!reader) return;
// 3. Read the stream in a loop
while (true) {
const { value, done } = await reader.read();
if (done) break; // Exit loop when stream is finished
if (value) {
markdownBuffer += decoder.decode(value, { stream: true });
updateOutput();
}
}
}
await
-ed fetch
call to the /fetch
endpoint. The key step here is getting the ReadableStream
from res.body
and creating a getReader()
instance to process it. A TextDecoder
is also initialised to convert the raw Uint8Array
data chunks into strings.while (true)
loop continuously calls await reader.read()
.
value
(the chunk of data) and done
(a boolean indicating if the stream has ended).done
is true
, the loop breaks.value
exists, it's decoded into a string, appended to the markdownBuffer
, and the updateOutput()
function is called to render the changes.fetch
with Readable StreamsFeature | Server-Sent Events (SSE) | fetch with Readable Streams |
---|---|---|
Simplicity | Easier to implement, especially on the frontend with the EventSource API. |
More complex, requires manual handling of the stream and decoding. |
Directionality | One-way (server to client). | Can be used for both sending and receiving data (e.g., in a POST request). |
Error Handling | The EventSource API has built-in error handling and automatic reconnection. |
Requires manual implementation of error handling and reconnection logic. |
Browser Support | Widely supported in modern browsers, but some older browsers may require a polyfill. | Supported in all modern browsers. |
Protocol | Built on top of standard HTTP. | A lower-level API that gives you more control over the request and response. |
Both Server-Sent Events and the fetch
API with Readable Streams are excellent choices for consuming streamed LLM responses from a frontend.
fetch
API provides more flexibility and control, but it also comes with a steeper learning curve.The best choice for your application will depend on your specific needs and requirements. By understanding the pros and cons of each approach, you can make an informed decision and build a more responsive and engaging user experience for your LLM-powered application.