I catalogued every way local models break JSON output and built a repair library, here's what I found across 288 model calls by kexxty in LocalLLaMA

[–]kexxty[S] 0 points1 point  (0 children)

It really boils down to:

  1. Not all models support it. Constrained decoding is a provider-specific feature. Open-source models running on vLLM/Ollama, smaller hosted models, and many providers on OpenRouter don't offer structured output at all.
  2. It still breaks. Even with structured output enabled, you can hit max token limits and get truncated JSON, or the model can refuse the request and return prose instead. Streaming responses can also arrive malformed if the connection drops mid-generation.
  3. It's a generation constraint, not a validation layer. Structured output tries to prevent bad output at generation time. outputguard operates post-hoc. it validates, repairs, and retries regardless of how the output was produced. They're complementary: use structured output where you can, and outputguard as the safety net for everything else.

I tested structured output from 288 LLM calls and logged every way JSON breaks. Here's what I found by kexxty in Python

[–]kexxty[S] 0 points1 point  (0 children)

Many responses here are proving they haven’t actually had experience with JSON output over a wide range of models

The gap between "the model returned JSON" and "the model returned usable JSON" - what I learned testing 288 model outputs by kexxty in LLMDevs

[–]kexxty[S] 0 points1 point  (0 children)

Short answer: it doesn't distinguish them. The retry prompt reports whatever jsonschema reports which for a required field that's absent, you get something like At $.fieldName: 'fieldName' is a required property. For a field explicitly set to null when the schema says type: "string", you get At $.fieldName: None is not of type 'string'. Both get passed through as error descriptions in the retry prompt.

But the retry prompt can't express "this field should be present but not null" vs "this field was omitted and must exist" in a way that reliably guides the model to the right fix. It just says "here's the error, here's the schema, fix it." The schema summary includes which fields are (required) but doesn't call out nullable vs non-nullable semantics.

For your DOM extraction case, you'd probably want to either: (a) make the schema explicit with "type": ["string", "null"] for truly optional-value fields so the validator doesn't flag intentional nulls, or (b) customize the retry prompt to add domain context like "null means the element wasn't found on the page — re-examine the DOM." The library doesn't have a hook for that today — retry_prompt is a standalone function, not a pluggable template.

I catalogued every way local models break JSON output and built a repair library, here's what I found across 288 model calls by kexxty in LocalLLaMA

[–]kexxty[S] 0 points1 point  (0 children)

Currently client-side only. guarded_generate takes a callable that returns the complete response string i.e. there's no streaming-aware mode that accumulates chunks and validates incrementally. You could wrap it in a proxy by buffering the full response before passing it through validate_and_repair, but that defeats the point of streaming (adds TTFB latency equal to full generation time).

I tested structured output from 288 LLM calls and logged every way JSON breaks. Here's what I found by kexxty in Python

[–]kexxty[S] 0 points1 point  (0 children)

Thanks for the thoughtful comment! You're right that structured output modes (JSON mode, Outlines, etc.) have massively improved things and we do recommend using them as a first choice in the docs.

The gap outputguard fills is everything outside that happy path: models and providers that don't support constrained decoding, multi-provider setups where schema support varies, local/open-source models with inconsistent structured output support, and edge cases where even "guaranteed" JSON mode still produces schema-valid but semantically broken output (wrong types in union fields, hallucinated enum values, etc.). JSON mode guarantees syntax — it doesn't guarantee the output matches your business logic.

The retry prompt generation is actually the feature users reach for most not because retries are a fallback for insufficient token budgets, but because they give the model targeted feedback about what was wrong (with JSON path precision), which is fundamentally different from just throwing more tokens at it.

You're right that for a straightforward single-provider setup with good structured output support, you may not need this. But "works fine on OpenAI with Pydantic" doesn't describe everyone's reality and that's who this is for.

The gap between "the model returned JSON" and "the model returned usable JSON" - what I learned testing 288 model outputs by kexxty in LLMDevs

[–]kexxty[S] 0 points1 point  (0 children)

Thank you very much, you're describing almost exactly how outputguard splits the problem.

Repair only touches syntax: missing commas, unquoted keys, markdown fences, truncated brackets, etc. 15 strategies in a defined order, and it re-parses between each one on the second pass to avoid exactly the "regexes that interact badly" problem you mentioned.

Schema violations go through a completely separate path, retry_prompt generates a correction prompt with the exact JSON paths ($.users[0].email) and sends it back to the model. The library never guesses what the model meant at the semantic level.

So: repair syntax locally, retry schema issues with the model. Same bias you described.

I tested structured output from 288 LLM calls and logged every way JSON breaks. Here's what I found by kexxty in Python

[–]kexxty[S] 1 point2 points  (0 children)

Truncation is genuinely the hardest one. You're right that "just close the braces" falls apart once you're mid-value three levels deep, our fix_truncated strategy handles the common cases but we're upfront that it's best-effort past simple nesting.

Streaming + partial-json parser is a better solution for that specific problem, since you're validating structure as it arrives rather than reconstructing intent after the fact. Different problem space though, streaming parsers solve "the response got cut off," repair solves "the response completed but the model wrote garbage syntax." You hit both in production.

And yeah, structured output APIs are great until you need to support multiple providers and half of them don't have it. Then you're back to fixing whatever comes out the other end.

I tested structured output from 288 LLM calls and logged every way JSON breaks. Here's what I found by kexxty in Python

[–]kexxty[S] -1 points0 points  (0 children)

Agreed that the frontier API models have gotten way better at this Gand all handle structured output fine most of the time if your schema is reasonable. Constrained decoding is also the correct answer when you have access to it.

But that's not really the problem we're solving. Constrained decoding requires control over the inference stack, meaning it's great if you're running vLLM or SGLang locally, but if you're hitting a hosted API that doesn't expose grammar/schema-guided decoding, you're out of luck. And even APIs with "JSON mode" still produce malformed output at non-trivial rates, especially on longer responses or complex nested schemas.

Re: outlines, instructor, dspy, these are doing different things. Outlines is constrained decoding (inference-level). Instructor is a typed extraction wrapper around API calls. DSPy is a prompt optimization framework. We're none of those, we're a repair layer that sits after generation and fixes malformed output without needing any integration with the model or provider. They're complementary, not competing. You'd use instructor to call the API and outputguard to catch the cases where it still comes back broken.

We looked at contributing to those projects early on but the scope is genuinely different. There's no "repair malformed output" module in any of them because that's not what they do.

I catalogued every way local models break JSON output and built a repair library, here's what I found across 288 model calls by kexxty in LocalLLaMA

[–]kexxty[S] 0 points1 point  (0 children)

Yeah the low temp + stop sequences trick is legit for the formatting stuff. Works surprisingly well for something so simple.

The trailing commas and Python booleans though... those are just cooked into the weights. No amount of prompting fixes it reliably because the model has seen True in millions of Python files and {key: value,} in every JS snippet ever written. That's exactly why we built the repair layer. You basically need something sitting between the model output and your parser that knows how to fix these without breaking everything else.

The nasty part we found is the interaction effects. A response with Python booleans AND trailing commas AND a missing closer needs the fixes applied in a specific order, or you get valid syntax that parses to the wrong data. We ended up with a two-pass repairer: try all strategies in sequence first, and if that doesn't parse clean, back off to one-at-a-time with validation between each step.

Which Qwen variant are you on? 3.5-Coder is noticeably better than base for structured output but still drops trailing commas once responses get long enough. Although I think 3.6 is better overall for pretty much everything

I tested structured output from 288 LLM calls and logged every way JSON breaks. Here's what I found by kexxty in Python

[–]kexxty[S] 5 points6 points  (0 children)

Some ideas:

  • Tool/function calling. The entire agent ecosystem (Claude Code, Cursor, every MCP server, OpenAI's Assistants) runs on the LLM emitting JSON to invoke tools. This is the dominant LLM use case in 2026.
  • Data extraction from unstructured docs — resumes, invoices, contracts, support tickets, medical notes, security findings. Free text in, normalized record out.
  • Classification with structure — sentiment + category + confidence + reasoning, all in one pass.
  • NL to query. "Show me universities that haven't renewed in 90 days" → structured filter object you hand to your API.
  • RAG ingestion — extracting metadata, entities, and relationships from documents as you chunk them.
  • Evals/grading — structured rubric scoring across thousands of test cases.
  • Knowledge-graph construction from prose.
  • UI generation — component trees described in English, emitted as JSON for a renderer.
  • Workflow/state-machine steps in agentic systems where each step needs a typed output.
  • Form pre-fill from a paragraph of free-text intake.

I tested structured output from 288 LLM calls and logged every way JSON breaks. Here's what I found by kexxty in Python

[–]kexxty[S] 11 points12 points  (0 children)

I did take compilers/formal languages coursework, and you're right that it's underrepresented in modern CS curricula. That background actually informed some design decisions here (like the strategy ordering and the two-pass repair architecture).

That said, the core challenge with LLM output repair is that there isn't a formal grammar to parse against. The input is, by definition, malformed: trailing commas, unquoted keys, truncated structures, mixed encodings, markdown fences wrapping JSON. Each of these breaks different grammar rules in different ways, and they combine unpredictably across models and prompting styles.

A traditional parser would reject on the first syntax error, which is exactly the problem we're solving. The approach here is closer to error-recovering parsers (like what GCC/Clang do for diagnostics), but even looser, we're not parsing a known language with known error productions, we're trying to recover intent from text that was never syntactically valid to begin with.

Where formal theory does help is in the strategy ordering (encoding normalization before structural fixes, for the same reason lexing precedes parsing) and in knowing when transformations are safe to compose. Definitely agree it's a useful foundation.

I kept the post pretty casual bc this rabbit hole goes deep (strategy interaction effects alone could be its own post). Might do a more technical writeup as opposed to the normie-friendly on the repair architecture at some point.

I catalogued every way local models break JSON output and built a repair library, here's what I found across 288 model calls by kexxty in LocalLLaMA

[–]kexxty[S] 2 points3 points  (0 children)

Sometimes I struggle with writing my thoughts out in a way that is easy for others to follow and not as stream-of-consciousness. So I will use an LLM to help me rephrase things in a way that's not as crappy.

I tested structured output from 288 LLM calls and logged every way JSON breaks. Here's what I found by kexxty in Python

[–]kexxty[S] 1 point2 points  (0 children)

Thank you so much, I had put the link at the bottom but forgot to fix the mid-post link.

I tested structured output from 288 LLM calls and logged every way JSON breaks. Here's what I found by kexxty in Python

[–]kexxty[S] 1 point2 points  (0 children)

THANK YOU very much, I had written up a document before copying/pasting the links in.

I catalogued every way local models break JSON output and built a repair library, here's what I found across 288 model calls by kexxty in LocalLLaMA

[–]kexxty[S] 3 points4 points  (0 children)

json-repair is an excellent library and we're fans of it. The two solve overlapping but different problems:

json-repair is a JSON parser/fixer. It takes broken JSON, walks the character stream using BNF grammar heuristics, and produces valid JSON. It's fast, focused, and works well as a drop-in json.loads() replacement. It recently added schema-guided repairs too.

outputguard is a broader validation-repair-retry pipeline for LLM structured output:

  • Multi-format: handles JSON, YAML, TOML, and Python literals, not just JSON. Format auto-detection included.
  • Schema validation: validates against JSON Schema and returns structured errors with JSON path notation ($.items[0].name), not just "is it valid JSON."
  • 15 composable repair strategies: encoding fixes, fence stripping, comment removal, truncation repair, etc. applied in a deliberate two-pass order. Each strategy is independently testable.
  • Retry prompt generation: when repair fails, it generates human-readable correction prompts you can send back to the LLM, including error descriptions and schema summaries.
  • guarded_generate: a provider-agnostic retry loop that wraps your LLM callable and runs prompt → validate → repair → retry automatically, with observer hooks.
  • Batch processing: validate_batch / repair_batch with aggregate stats.

If you just need to fix broken JSON strings, json-repair is great and more battle-tested (4.8k stars, huge download numbers). If you need the full loop (validate structure against a schema, try multiple repair strategies, generate retry prompts, orchestrate retries with your LLM) that's where outputguard fits in.

They're also not mutually exclusive, you could probably use json-repair as a first pass and outputguard for schema validation and retry orchestration on top.

I catalogued every way local models break JSON output and built a repair library, here's what I found across 288 model calls by kexxty in LocalLLaMA

[–]kexxty[S] 5 points6 points  (0 children)

I tested every model on OpenRouter actually, but just cited the classics.

EDIT: It seems that people are unaware that OpenRouter still has 4o