Building a Coding Agent in Zig: SSE streaming

Hypercode streaming illustration

Five tools, a working agent loop — and yet, on every turn, the user waited silently for 3, 5, sometimes 15 seconds before the answer arrived in one block. This post fixes that: SSE streaming. Tokens appear on stdout as the model produces them. It's what flips Hypercode from a demo into something you actually want to use.

Code stays on github.com/alexisbchz/hypercode.

Why streaming, really

The obvious argument — "it's more responsive" — isn't the only one. Streaming changes the perception of time. Without it, the user can't tell if the model has stalled, if we're waiting on tools, if the network is dead. With streaming, every token is a sign of life. A 10-second block response is unbearable; the same response as a flow is comfortable.

On the agent side, it's also a finish_reason signal. When we see data: [DONE], we know this turn is over. We can move on — execute tools, prompt the user — without guessing.

The shape of an SSE stream

OpenRouter, like OpenAI, sends a Server-Sent Events stream when you pass stream: true. Each event is a line prefixed with data: :

data: {"id":"...","choices":[{"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"...","choices":[{"delta":{"content":" world"},"finish_reason":null}]}

data: {"id":"...","choices":[{"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Three changes from non-streaming:

The field is delta (not message) — each chunk contains the increment.
Events are separated by a blank line (\n\n).
The end of the stream is marked by data: [DONE].

For tool calls, it's twistier. Instead of one chunk with the whole call, the model sends fragments, each tagged with an index:

data: {"choices":[{"delta":{"tool_calls":[{"index":0,"id":"call_abc","function":{"name":"read","arguments":""}}]}}]}
data: {"choices":[{"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\"path"}}]}}]}
data: {"choices":[{"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\":\"src/main.zig\"}"}}]}}]}
data: {"choices":[{"finish_reason":"tool_calls"}]}
data: [DONE]

The id, name, and arguments arrive in pieces. We have to reassemble them by index before we can execute the tool.

Fortunately, OpenAI/OpenRouter doesn't interleave in practice: a turn is either streamed text or reassembled tool_calls. Not both.

Enable streaming

One line in the body builder:

src/openrouter.zig

try s.objectField("stream");
try s.write(true);

And the request changes shape. From now on, the response is an SSE stream instead of one JSON.

From `client.fetch` to `client.request`

client.fetch is convenient but it waits for the end before filling our buffer. To stream, we drop down a level and talk to the response reader directly:

var req = client.request(.POST, uri, .{
    .headers = .{ .accept_encoding = .omit },
    .extra_headers = &.{
        .{ .name = "authorization", .value = auth },
        .{ .name = "content-type", .value = "application/json" },
        .{ .name = "accept", .value = "text/event-stream" },
    },
}) catch return .network_error;
defer req.deinit();

req.sendBodyComplete(payload) catch return .network_error;

var redirect_buf: [constants.stream_redirect_buffer_bytes]u8 = undefined;
var response = req.receiveHead(&redirect_buf) catch return .network_error;
if (response.head.status != .ok) return .{ .http_status = @intFromEnum(response.head.status) };

var transfer_buf: [constants.stream_transfer_buffer_bytes]u8 = undefined;
const reader = response.reader(&transfer_buf);

Element	Why
`accept_encoding = .omit`	Without it, Zig negotiates gzip by default. OpenRouter compresses, we read garbage. Disabling forces raw text.
`accept: text/event-stream`	Signals SSE to the server. Not strictly required (the body's `stream: true` suffices), but clean.
Static 16 KiB `transfer_buf`	The reader's intermediate buffer. One allocation for the whole stream.

Line-by-line read loop

while (true) {
    const raw = reader.takeDelimiterInclusive('\n') catch |err| switch (err) {
        error.EndOfStream => break,
        else => return .network_error,
    };
    const line = std.mem.trim(u8, raw, "\r\n");
    if (line.len == 0) continue;
    if (!std.mem.startsWith(u8, line, "data: ")) continue;
    const data = line[6..];
    if (std.mem.eql(u8, data, "[DONE]")) break;

    const parsed = std.json.parseFromSlice(Delta, gpa, data, .{ .ignore_unknown_fields = true }) catch continue;
    defer parsed.deinit();
    if (parsed.value.choices.len == 0) continue;
    const delta = parsed.value.choices[0].delta;

    if (delta.content) |c| if (c.len > 0) {
        try out_writer.writeAll(c);
        try out_writer.flush();
        try text_acc.writer.writeAll(c);
    };

    if (delta.tool_calls) |calls| {
        for (calls) |c| try absorb_chunk(&accumulators, c);
    }
}

Six teaching points:

1. takeDelimiterInclusive('\n') instead of takeDelimiterExclusive.

In the Post 04 conversation post, we discovered that takeDelimiterExclusive returns content without the delimiter — but doesn't consume the \n either. On an empty line, it returns "" every call without advancing. Infinite loop. The Inclusive variant consumes the \n properly; we strip it after with std.mem.trim.

2. if (line.len == 0) continue;

Empty lines are SSE event separators. Skip.

3. if (!std.mem.startsWith(u8, line, "data: ")) continue;

SSE also allows comments (:), event:, and id: lines we don't use. Filter.

4. [DONE] is a literal string, not JSON.

It's an OpenAI-specific end-of-stream marker. Exit the loop.

5. parseFromSlice ... catch continue;

A malformed chunk doesn't blow up the whole turn. Skip and continue. Rare in practice, but OpenRouter's public surface can serve anything.

6. out_writer.flush() after every chunk.

Without flush, the stdout buffer (std.Io.File.Writer) holds the bytes. With flush, they leave for the terminal immediately. That's what makes the "tokens appearing" effect visible.

Reassembling tool calls

A function to absorb a fragment:

const ToolAccumulator = struct {
    used: bool = false,
    id: std.Io.Writer.Allocating,
    name: std.Io.Writer.Allocating,
    arguments: std.Io.Writer.Allocating,
};

fn absorb_chunk(accumulators: *[constants.tool_calls_per_response_max]ToolAccumulator, c: ChunkToolCall) !void {
    if (c.index >= accumulators.len) return;
    const acc = &accumulators[c.index];
    acc.used = true;
    if (c.id) |id| try acc.id.writer.writeAll(id);
    if (c.function) |f| {
        if (f.name) |n| try acc.name.writer.writeAll(n);
        if (f.arguments) |a| try acc.arguments.writer.writeAll(a);
    }
}

It's an array of 16 accumulators (tool_calls_per_response_max). Each chunk has an index; we use it to route the fragment to the right accumulator. Three Allocating writers per accumulator: id, name, arguments. Each grows as fragments arrive.

When the stream ends, collect those that saw at least one fragment:

fn collect_tool_calls(gpa, accumulators) ![]const ToolCall {
    var count: usize = 0;
    for (accumulators.*) |acc| if (acc.used) { count += 1; };
    if (count == 0) return &.{};

    const owned = try gpa.alloc(ToolCall, count);
    var i: usize = 0;
    for (accumulators) |*acc| {
        if (!acc.used) continue;
        owned[i] = .{
            .id = try acc.id.toOwnedSlice(),
            .name = try acc.name.toOwnedSlice(),
            .arguments_json = try acc.arguments.toOwnedSlice(),
        };
        i += 1;
    }
    return owned;
}

toOwnedSlice transfers ownership from the accumulator to the returned slice — no copy. The accumulators become empty; deinit on return is free.

The contract of `call` changes

Before, call returned a Result.text: []const u8 at the end. Now, the text is written as it arrives to a writer we pass in. At the end it still returns the complete version, so we can append it to the session.

pub fn call(
    gpa: std.mem.Allocator,
    io: std.Io,
    api_key: []const u8,
    model: []const u8,
    messages: []const session.Message,
    out_writer: *std.Io.Writer,  // ← new
) !Result {

The caller (main) passes stdout. While call runs, tokens appear in real time. When the function returns with .text, we have the complete text — used by session.append_assistant_text for multi-turn memory.

src/main.zig

const result = try openrouter.call(gpa, io, cfg.api_key, cfg.model, session.messages.items, stdout);
switch (result) {
    .text => |text| {
        try stdout.writeAll("\n");
        try stdout.flush();
        try session.append_assistant_text(text);
        return;
    },
    .tool_calls => |calls| {
        try session.append_assistant_tool_calls(calls);
        for (calls) |c| try run_one_tool(gpa, io, session, stderr, c);
    },
    // ...
}

The trailing \n is necessary — the model doesn't terminate its responses with a newline, so the next prompt would stick to the last line.

Why accumulate the text too

The alternative — pass stdout and recover the bytes written to it — would need a tee writer (a writer that emits to two destinations). Doable, but one more vtable to implement. Simpler: openrouter.call writes to out_writer and to its own buffer, doubling the in-memory I/O (negligible), keeping the code linear.

if (delta.content) |c| if (c.len > 0) {
    try out_writer.writeAll(c);    // visible to the user
    try out_writer.flush();
    try text_acc.writer.writeAll(c);  // for the return
};

That's the cost of simplicity: three clear lines instead of a custom Writer.

Live test

./zig-out/bin/hypercode "Say hi in 3 short words."

The text appears word by word in the terminal. No waiting.

Hi there friend!

With a tool:

./zig-out/bin/hypercode "Read src/main.zig and tell me what the answer function does, in 2 sentences."

→ read({"path": "src/main.zig"})
The `answer` function runs an agent loop that calls the OpenRouter API to send
the conversation history to an LLM and receive a response. If the LLM returns
tool calls, it executes them via `run_one_tool` and continues the loop; if it
returns text, it outputs the answer and returns.

The tool call appears immediately (single chunk, not streamed), then the answer text arrives in streaming. Comfortable.

The commits

969473f feat(openrouter): stream chat completions; accumulate tool_calls by index

One commit for the whole streaming change. The previous refactor — Message as a tagged union, ownership transfer — made the transition clean: we touch only openrouter.zig and six lines of main.zig. The rest of the code doesn't even know the response arrives as a stream.

Conclusion

Hypercode is now fluid. The model talks, we read. If the model wants a tool, we run it, and the cycle continues. That's the UX a coding assistant should have.

With five tools (read, write, edit, bash, grep) plus streaming, the base of a working agent is there. Two big pieces remain in the series:

Persistence: save the conversation history so it survives a crash or restart. Append-only journal, replay at startup.
Sandboxing: protect the user from a misbehaved bash. Probably bwrap on Linux, sandbox-exec on macOS.

In the next post, we tackle persistence. That's what turns a stateless agent into something that resembles a real work session.

Stuck, or want to share notes? Join the Discord server.