
Five tools, a working agent loop — and yet, on every turn, the user waited silently for 3, 5, sometimes 15 seconds before the answer arrived in one block. This post fixes that: SSE streaming. Tokens appear on stdout as the model produces them. It's what flips Hypercode from a demo into something you actually want to use.
Code stays on github.com/alexisbchz/hypercode.
The obvious argument — "it's more responsive" — isn't the only one. Streaming changes the perception of time. Without it, the user can't tell if the model has stalled, if we're waiting on tools, if the network is dead. With streaming, every token is a sign of life. A 10-second block response is unbearable; the same response as a flow is comfortable.
On the agent side, it's also a finish_reason signal. When we see data: [DONE], we know this turn is over. We can move on — execute tools, prompt the user — without guessing.
OpenRouter, like OpenAI, sends a Server-Sent Events stream when you pass stream: true. Each event is a line prefixed with data: :
data: {"id":"...","choices":[{"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"...","choices":[{"delta":{"content":" world"},"finish_reason":null}]}
data: {"id":"...","choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Three changes from non-streaming:
delta (not message) — each chunk contains the increment.\n\n).data: [DONE].For tool calls, it's twistier. Instead of one chunk with the whole call, the model sends fragments, each tagged with an index:
data: {"choices":[{"delta":{"tool_calls":[{"index":0,"id":"call_abc","function":{"name":"read","arguments":""}}]}}]}
data: {"choices":[{"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\"path"}}]}}]}
data: {"choices":[{"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\":\"src/main.zig\"}"}}]}}]}
data: {"choices":[{"finish_reason":"tool_calls"}]}
data: [DONE]
The id, name, and arguments arrive in pieces. We have to reassemble them by index before we can execute the tool.
Fortunately, OpenAI/OpenRouter doesn't interleave in practice: a turn is either streamed text or reassembled tool_calls. Not both.
One line in the body builder:
try s.objectField("stream");
try s.write(true);
And the request changes shape. From now on, the response is an SSE stream instead of one JSON.
client.fetch to client.requestclient.fetch is convenient but it waits for the end before filling our buffer. To stream, we drop down a level and talk to the response reader directly:
var req = client.request(.POST, uri, .{
.headers = .{ .accept_encoding = .omit },
.extra_headers = &.{
.{ .name = "authorization", .value = auth },
.{ .name = "content-type", .value = "application/json" },
.{ .name = "accept", .value = "text/event-stream" },
},
}) catch return .network_error;
defer req.deinit();
req.sendBodyComplete(payload) catch return .network_error;
var redirect_buf: [constants.stream_redirect_buffer_bytes]u8 = undefined;
var response = req.receiveHead(&redirect_buf) catch return .network_error;
if (response.head.status != .ok) return .{ .http_status = @intFromEnum(response.head.status) };
var transfer_buf: [constants.stream_transfer_buffer_bytes]u8 = undefined;
const reader = response.reader(&transfer_buf);
| Element | Why |
|---|---|
accept_encoding = .omit | Without it, Zig negotiates gzip by default. OpenRouter compresses, we read garbage. Disabling forces raw text. |
accept: text/event-stream | Signals SSE to the server. Not strictly required (the body's stream: true suffices), but clean. |
Static 16 KiB transfer_buf | The reader's intermediate buffer. One allocation for the whole stream. |
while (true) {
const raw = reader.takeDelimiterInclusive('\n') catch |err| switch (err) {
error.EndOfStream => break,
else => return .network_error,
};
const line = std.mem.trim(u8, raw, "\r\n");
if (line.len == 0) continue;
if (!std.mem.startsWith(u8, line, "data: ")) continue;
const data = line[6..];
if (std.mem.eql(u8, data, "[DONE]")) break;
const parsed = std.json.parseFromSlice(Delta, gpa, data, .{ .ignore_unknown_fields = true }) catch continue;
defer parsed.deinit();
if (parsed.value.choices.len == 0) continue;
const delta = parsed.value.choices[0].delta;
if (delta.content) |c| if (c.len > 0) {
try out_writer.writeAll(c);
try out_writer.flush();
try text_acc.writer.writeAll(c);
};
if (delta.tool_calls) |calls| {
for (calls) |c| try absorb_chunk(&accumulators, c);
}
}
Six teaching points:
1. takeDelimiterInclusive('\n') instead of takeDelimiterExclusive.
In the Post 04 conversation post, we discovered that takeDelimiterExclusive returns content without the delimiter — but doesn't consume the \n either. On an empty line, it returns "" every call without advancing. Infinite loop. The Inclusive variant consumes the \n properly; we strip it after with std.mem.trim.
2. if (line.len == 0) continue;
Empty lines are SSE event separators. Skip.
3. if (!std.mem.startsWith(u8, line, "data: ")) continue;
SSE also allows comments (:), event:, and id: lines we don't use. Filter.
4. [DONE] is a literal string, not JSON.
It's an OpenAI-specific end-of-stream marker. Exit the loop.
5. parseFromSlice ... catch continue;
A malformed chunk doesn't blow up the whole turn. Skip and continue. Rare in practice, but OpenRouter's public surface can serve anything.
6. out_writer.flush() after every chunk.
Without flush, the stdout buffer (std.Io.File.Writer) holds the bytes. With flush, they leave for the terminal immediately. That's what makes the "tokens appearing" effect visible.
A function to absorb a fragment:
const ToolAccumulator = struct {
used: bool = false,
id: std.Io.Writer.Allocating,
name: std.Io.Writer.Allocating,
arguments: std.Io.Writer.Allocating,
};
fn absorb_chunk(accumulators: *[constants.tool_calls_per_response_max]ToolAccumulator, c: ChunkToolCall) !void {
if (c.index >= accumulators.len) return;
const acc = &accumulators[c.index];
acc.used = true;
if (c.id) |id| try acc.id.writer.writeAll(id);
if (c.function) |f| {
if (f.name) |n| try acc.name.writer.writeAll(n);
if (f.arguments) |a| try acc.arguments.writer.writeAll(a);
}
}
It's an array of 16 accumulators (tool_calls_per_response_max). Each chunk has an index; we use it to route the fragment to the right accumulator. Three Allocating writers per accumulator: id, name, arguments. Each grows as fragments arrive.
When the stream ends, collect those that saw at least one fragment:
fn collect_tool_calls(gpa, accumulators) ![]const ToolCall {
var count: usize = 0;
for (accumulators.*) |acc| if (acc.used) { count += 1; };
if (count == 0) return &.{};
const owned = try gpa.alloc(ToolCall, count);
var i: usize = 0;
for (accumulators) |*acc| {
if (!acc.used) continue;
owned[i] = .{
.id = try acc.id.toOwnedSlice(),
.name = try acc.name.toOwnedSlice(),
.arguments_json = try acc.arguments.toOwnedSlice(),
};
i += 1;
}
return owned;
}
toOwnedSlice transfers ownership from the accumulator to the returned slice — no copy. The accumulators become empty; deinit on return is free.
call changesBefore, call returned a Result.text: []const u8 at the end. Now, the text is written as it arrives to a writer we pass in. At the end it still returns the complete version, so we can append it to the session.
pub fn call(
gpa: std.mem.Allocator,
io: std.Io,
api_key: []const u8,
model: []const u8,
messages: []const session.Message,
out_writer: *std.Io.Writer, // ← new
) !Result {
The caller (main) passes stdout. While call runs, tokens appear in real time. When the function returns with .text, we have the complete text — used by session.append_assistant_text for multi-turn memory.
const result = try openrouter.call(gpa, io, cfg.api_key, cfg.model, session.messages.items, stdout);
switch (result) {
.text => |text| {
try stdout.writeAll("\n");
try stdout.flush();
try session.append_assistant_text(text);
return;
},
.tool_calls => |calls| {
try session.append_assistant_tool_calls(calls);
for (calls) |c| try run_one_tool(gpa, io, session, stderr, c);
},
// ...
}
The trailing \n is necessary — the model doesn't terminate its responses with a newline, so the next prompt would stick to the last line.
The alternative — pass stdout and recover the bytes written to it — would need a tee writer (a writer that emits to two destinations). Doable, but one more vtable to implement. Simpler: openrouter.call writes to out_writer and to its own buffer, doubling the in-memory I/O (negligible), keeping the code linear.
if (delta.content) |c| if (c.len > 0) {
try out_writer.writeAll(c); // visible to the user
try out_writer.flush();
try text_acc.writer.writeAll(c); // for the return
};
That's the cost of simplicity: three clear lines instead of a custom Writer.
./zig-out/bin/hypercode "Say hi in 3 short words."
The text appears word by word in the terminal. No waiting.
Hi there friend!
With a tool:
./zig-out/bin/hypercode "Read src/main.zig and tell me what the answer function does, in 2 sentences."
→ read({"path": "src/main.zig"})
The `answer` function runs an agent loop that calls the OpenRouter API to send
the conversation history to an LLM and receive a response. If the LLM returns
tool calls, it executes them via `run_one_tool` and continues the loop; if it
returns text, it outputs the answer and returns.
The tool call appears immediately (single chunk, not streamed), then the answer text arrives in streaming. Comfortable.
969473f feat(openrouter): stream chat completions; accumulate tool_calls by index
One commit for the whole streaming change. The previous refactor — Message as a tagged union, ownership transfer — made the transition clean: we touch only openrouter.zig and six lines of main.zig. The rest of the code doesn't even know the response arrives as a stream.
Hypercode is now fluid. The model talks, we read. If the model wants a tool, we run it, and the cycle continues. That's the UX a coding assistant should have.
With five tools (read, write, edit, bash, grep) plus streaming, the base of a working agent is there. Two big pieces remain in the series:
bash. Probably bwrap on Linux, sandbox-exec on macOS.In the next post, we tackle persistence. That's what turns a stateless agent into something that resembles a real work session.
Stuck, or want to share notes? Join the Discord server.