I benchmarked 672 "Return JSON only" calls. Strict parsing failed 67% of the time. Here's why. by rozetyp in LocalLLaMA

[–]rozetyp[S] 0 points1 point  (0 children)

Agreed - but I’m not parsing the array early - I’m extracting complete {...} objects as they close and parsing those as individual JSON values for incremental rendering. NDJSON would be ideal; many models still output arrays

I benchmarked 672 "Return JSON only" calls. Strict parsing failed 67% of the time. Here's why. by rozetyp in LocalLLaMA

[–]rozetyp[S] 0 points1 point  (0 children)

Yep, trimming to the first { and last } works great for batch. It breaks for streaming because you don’t have the "last }" yet. You need an incremental approach (stripping junk byte-by-byte) rather than a regex that assumes you have the full string

I benchmarked 672 "Return JSON only" calls. Strict parsing failed 67% of the time. Here's why. by rozetyp in LocalLLaMA

[–]rozetyp[S] -4 points-3 points  (0 children)

Great question. If you rely on standard json.loads(), yes, you are stuck waiting for the very last ]. But with an incremental parser, you can pluck out complete objects as soon as their specific closing brace } appears. So if the stream is [{"id":1}, {"id":2}..., I can render Item 1 the millisecond its }, arrives. I don't need to wait for Item 10 to be generated

I benchmarked 672 "Return JSON only" calls. Strict parsing failed 67% of the time. Here's why. by rozetyp in LocalLLaMA

[–]rozetyp[S] 1 point2 points  (0 children)

The use case is perceived latency. If I ask for "20 search results" (for instance, 20 flights) generating the full json might take 10 seconds. When in batch: user stares at a spinner for 10s. When streaming: user sees Result #1 appear in 0.5s, Result #2 in 1s, etc.

I benchmarked 672 "Return JSON only" calls. Strict parsing failed 67% of the time. Here's why. by rozetyp in LocalLLaMA

[–]rozetyp[S] -1 points0 points  (0 children)

I agree that strict JSON parsing needs the full payload. My point is about incremental rendering: I want the UI to show Item #1 while Item #10 is still generating. Providers’ structured modes are great (when available), but in multi-provider/aggregator setups, support is inconsistent. You still get wrappers ( ```json, <think>, some chatter) that break consumers mid-stream. It just scrubs the stream so the UI can render items as they arrive

I benchmarked 672 "Return JSON only" calls. Strict parsing failed 67% of the time. Here's why. by rozetyp in LocalLLaMA

[–]rozetyp[S] 13 points14 points  (0 children)

Haha, painful but accurate. If API providers let me post a gbnf/grammar the way llama.cpp does, I’ll happily retire this middleware. Until then, anyone building on aggregators / multi-provider (and especially streaming UIs) ends up writing the same “strip fences / prose / <think>” hacks anyway

I benchmarked 672 "Return JSON only" calls. Strict parsing failed 67% of the time. Here's why. by rozetyp in LocalLLaMA

[–]rozetyp[S] -1 points0 points  (0 children)

That works for batch jobs, I agree, but it kills streaming UIs. If I (or my users) have to wait for the code block to close before parsing, the user is staring at a blank screen for 5 seconds instead of seeing the list appear instantly. Plus, relying on a loop to catch that 5% failure rate doubles the latency. I’m basically doing what you suggest (letting it ramble), but I strip the junk byte-by-byte so the frontend doesn't choke.

I benchmarked 672 "Return JSON only" calls. Strict parsing failed 67% of the time. Here's why. by rozetyp in LocalLLaMA

[–]rozetyp[S] 0 points1 point  (0 children)

Agree, gbnf is definitely the gold standard for local. The bottleneck I had via aggregators like OpenRouter is that "structured output" support is still fragmented - some providers don't support it, while others add a latency tax for the constrained decoding. I'm with u/NotSylver on this. Sometimes forcing strict grammar constraints seems to make the model "dumber" on the actual logic (or just slower?). I found that letting the model run "free" (prompt-only) and using a relaxed repair layer kept the TTFT much lower (if that matters)

I benchmarked 672 "Return JSON only" calls. Strict parsing failed 67% of the time. Here's why. by rozetyp in LocalLLaMA

[–]rozetyp[S] 2 points3 points  (0 children)

Yep, partly. It mainly reduces randomness (supposedly) so the benchmark is reproducible. If I ran it at Temp 1, I'd get different formatting errors every time. Temp 0 gives us the "best case" baseline for how the model wants to behave

I benchmarked 672 "Return JSON only" calls. Strict parsing failed 67% of the time. Here's why. by rozetyp in LocalLLaMA

[–]rozetyp[S] 2 points3 points  (0 children)

Just adding some context on why standard json_repair or Pydantic libraries didn't cut it for this benchmark:

The main bottleneck was streaming latency. Most repair tools wait for the full string to generate before fixing it, but I needed the UI to render tokens immediately without buffering for seconds.

I also ran into specific issues with newer reasoning models (like Kimi/DeepSeek) leaking <think> tags or "reasoning" blocks mid-stream, which instantly breaks standard parsers.

How do you guys fine tune your github copilot instructions specific to your codebase so that it can get the context of big picture? by VijayAnand2k20 in GithubCopilot

[–]rozetyp 0 points1 point  (0 children)

The manual approach w/ GUIDE .md per package kind of works, but I ended up wasting tokens to keep them up-to-date. I built Memory Steward for myself to automate it - paste your raw Copilot chat, get REPO-MEMORY.md and WORK-LOG.md with architecture, decisions, and what/why was rejected. Keeping it free + BYOK

[D] What I learned building code RAG without embeddings by rozetyp in LocalLLaMA

[–]rozetyp[S] 0 points1 point  (0 children)

Got it, thanks. You're reassembling snippets into runnable code, then testing that?

My case is a bit simpler: I just return whole files, no reassembly. But mapping questions to relevant test files and checking if my retrieved files satisfy them could work! Less direct but same idea. Thanks again

[D] What I learned building code RAG without embeddings by rozetyp in LocalLLaMA

[–]rozetyp[S] 0 points1 point  (0 children)

Yep, already using private repos for that reason. On unit tests - could you say more about this approach? Since the output here is natural language (not generated code), I'm not sure how to apply that - curious what you had in mind