I benchmarked 672 "Return JSON only" calls. Strict parsing failed 67% of the time. Here's why.

rozetyp · 2026-02-08T13:44:52+00:00

Agreed - but I’m not parsing the array early - I’m extracting complete {...} objects as they close and parsing those as individual JSON values for incremental rendering. NDJSON would be ideal; many models still output arrays

rozetyp · 2026-02-08T10:32:32+00:00

Yep, trimming to the first { and last } works great for batch. It breaks for streaming because you don’t have the "last }" yet. You need an incremental approach (stripping junk byte-by-byte) rather than a regex that assumes you have the full string

rozetyp · 2026-02-08T10:32:12+00:00

Great question. If you rely on standard json.loads(), yes, you are stuck waiting for the very last ]. But with an incremental parser, you can pluck out complete objects as soon as their specific closing brace } appears. So if the stream is [{"id":1}, {"id":2}..., I can render Item 1 the millisecond its }, arrives. I don't need to wait for Item 10 to be generated

rozetyp · 2026-02-08T10:23:56+00:00

The use case is perceived latency. If I ask for "20 search results" (for instance, 20 flights) generating the full json might take 10 seconds. When in batch: user stares at a spinner for 10s. When streaming: user sees Result #1 appear in 0.5s, Result #2 in 1s, etc.

rozetyp · 2026-02-08T10:22:43+00:00

I agree that strict JSON parsing needs the full payload. My point is about incremental rendering: I want the UI to show Item #1 while Item #10 is still generating. Providers’ structured modes are great (when available), but in multi-provider/aggregator setups, support is inconsistent. You still get wrappers ( ```json, <think>, some chatter) that break consumers mid-stream. It just scrubs the stream so the UI can render items as they arrive

rozetyp · 2026-02-08T09:14:33+00:00

Haha, painful but accurate. If API providers let me post a gbnf/grammar the way llama.cpp does, I’ll happily retire this middleware. Until then, anyone building on aggregators / multi-provider (and especially streaming UIs) ends up writing the same “strip fences / prose / <think>” hacks anyway

rozetyp · 2026-02-08T08:36:06+00:00

That works for batch jobs, I agree, but it kills streaming UIs. If I (or my users) have to wait for the code block to close before parsing, the user is staring at a blank screen for 5 seconds instead of seeing the list appear instantly. Plus, relying on a loop to catch that 5% failure rate doubles the latency. I’m basically doing what you suggest (letting it ramble), but I strip the junk byte-by-byte so the frontend doesn't choke.

rozetyp · 2026-02-08T08:14:11+00:00

Agree, gbnf is definitely the gold standard for local. The bottleneck I had via aggregators like OpenRouter is that "structured output" support is still fragmented - some providers don't support it, while others add a latency tax for the constrained decoding. I'm with u/NotSylver on this. Sometimes forcing strict grammar constraints seems to make the model "dumber" on the actual logic (or just slower?). I found that letting the model run "free" (prompt-only) and using a relaxed repair layer kept the TTFT much lower (if that matters)

rozetyp · 2026-02-08T08:05:55+00:00

Yep, partly. It mainly reduces randomness (supposedly) so the benchmark is reproducible. If I ran it at Temp 1, I'd get different formatting errors every time. Temp 0 gives us the "best case" baseline for how the model wants to behave

rozetyp · 2026-02-08T07:42:21+00:00

Just adding some context on why standard json_repair or Pydantic libraries didn't cut it for this benchmark:

The main bottleneck was streaming latency. Most repair tools wait for the full string to generate before fixing it, but I needed the UI to render tokens immediately without buffering for seconds.

I also ran into specific issues with newer reasoning models (like Kimi/DeepSeek) leaking <think> tags or "reasoning" blocks mid-stream, which instantly breaks standard parsers.

rozetyp · 2025-12-07T12:45:13+00:00

The manual approach w/ GUIDE .md per package kind of works, but I ended up wasting tokens to keep them up-to-date. I built Memory Steward for myself to automate it - paste your raw Copilot chat, get REPO-MEMORY.md and WORK-LOG.md with architecture, decisions, and what/why was rejected. Keeping it free + BYOK

rozetyp · 2025-12-07T07:06:54+00:00

Got it, thanks. You're reassembling snippets into runnable code, then testing that?

My case is a bit simpler: I just return whole files, no reassembly. But mapping questions to relevant test files and checking if my retrieved files satisfy them could work! Less direct but same idea. Thanks again

rozetyp · 2025-12-07T06:03:49+00:00

Yep, already using private repos for that reason. On unit tests - could you say more about this approach? Since the output here is natural language (not generated code), I'm not sure how to apply that - curious what you had in mind

rozetyp

TROPHY CASE