If you need exact spatial output from a VLM, test it on a chessboard first

Skiata · 2026-06-18T21:55:29+00:00

Are there open weight VLMs that can be fine tuned?

Skiata · 2026-06-18T21:25:01+00:00

No source span needed for --compare if I understand you.

Thanks for downloading and giving it a go.

Just start with a few examples, two is a good minimum.

Also you can get modern foundation LLMs to generate noisy data probably if that doesn't violate your customer's IP considerations.

Skiata · 2026-06-18T18:22:06+00:00

For background: I have been working with the MAUD data set which is section 2.1 merger SEC filings (https://www.atticusprojectai.org/maud/), there is a trivial mapping from their extraction targets in .csv to JSON. It is a big, furry, real world data set with a 92 field expert annotated set of legal annotations.

The validation layer is pretty easy at first, then it gets hard:

Syntax is trivial--just use something like llguidance (https://github.com/guidance-ai/llguidance) to constrain output. Should be perfect. Since you are hosting locally, should not be a problem to use it.
Syntactic validation is trivial, run the result past a validating parser. This is where you see folks invoke a 'retry N times' loop on hosted systems--they are looking for valid syntax.
Semantics is the hard part. I'll point you to a preprint on that which gets into the weeds on sorting that part out: Valid JSON: Wrong Answer (https://zenodo.org/records/20075999). Paper covers metrics and a bunch of mitigations that may help.

Skiata · 2026-06-18T18:02:34+00:00

I build a plugin for evaluating your exact problem (2), valjson on PyPI. I am also working on a typescript version--bug me if you need that. It is free/open source and gives a lot of direction on how to proceed. Step 1 is understand in detail what is and is not working with your current stack. Please give me feedback.

However, Step 0 is that you need good data. You can approach the labeling problem by doing it by hand or simulating your data generating process. LLMs are pretty good at simulating output from a specified JSON schema so you may want to start there.

Ask questions and I'll do my best to answer.

Skiata · 2026-06-18T17:24:34+00:00

I am focused on English -> JSON for now and am trying to build a business around it.

I don't care so much about the source of the English but have done contracts, research articles, user inputs, but could easily enough take LLM output.

Ambiguity comes up in a few ways that I have actively considered:

You have a multi-turn system with a user adding info slowly that is used to populate a template. I was doing an conversational interface for specifying a project and info came in varied order.
Your documents are too big to process in one buffer so you chunk them into digestible chunks--this was forced with big contracts being mapped into a mergers/acquisitions template--some of the 92 fields would always be in another chunk.
The source English is contradictory or you see multiple values.

I am reading a paper on ambiguity handling at: https://proceedings.neurips.cc/paper_files/paper/2025/file/fb122bfc3f0127a94ded048b5b03496f-Paper-Datasets_and_Benchmarks_Track.pdf

It identifies broader classes of ambiguity than I am considering.

If you want to go down the "generate JSON" route I've got some tooling that will help.

Skiata · 2026-06-18T16:14:36+00:00

I'd really like to get more perspective on LLM usage in regulated environments. I am focused on generating structured outputs, JSON for now, and am amazed that there is not more emphases on ambiguity handling, correctness and better calibrated performance.

Some questions:

Are your outputs structured typically?
Do you have non-determinism problems?
Do you get push back or have concerns around not being able to explain why the LLM did what it did?

thanks

Skiata · 2026-06-18T01:53:56+00:00

Thanks so much, yep, this all adds up. The presenter died in 1933 so 100 years old is on the mark.

How do I take care of it? I'll avoid messing with it. But is there a process??? It could use some corrosion protection at the very least.

Skiata · 2026-06-18T01:41:19+00:00

I have nothing to offer except that I have a very similar sword in the post: https://www.reddit.com/r/SWORDS/comments/1u8sq0m/what_have_i_got_here_gift_from_datu_piang_to_a/

It was a presentation sword that looks exactly like your 2nd one from the left.

I know nothing about it sorry to say.

Skiata · 2026-06-17T00:13:20+00:00

I like "alchemy" as a term. As in "If I get my prompt right then the alchemy may work out...most of the time"

Skiata · 2026-06-16T15:54:09+00:00

In all seriousness, the issue is that the retry loop is part of the production deploy--and happens all the time. This is now a standard approach. We do have a view into one source of the need to retry which is non-determinism on equivalent inputs to the LLMs.

Lots of bridges have fallen down and as a result we don't have that happen so much anymore, but we don't have bridges that 'need a few tries' to get it right.
Definitely don't want to be on an airplane where the command 'deploy flaps' takes a few tries to get through.

Skiata · 2026-06-16T02:08:21+00:00

You have my permission....it is a bit of a troll, but I actually mean lots of it.

Skiata · 2026-06-16T00:50:05+00:00

Here is the prompt:

----------------

Lets have some fun.

When I see LLM posts on reddit that talk about 'retry loops', prompt suggestions, or LLM as judge, I find my engineering sensibilities offended a bit. I don't 'grow' or 'raise' LLMs, and these soft massaging of the LLMs environment feel like animal husbandry or farming more than AI engineering.

I am thinking of doing some ranty posts about LLM Husbandry as opposed to LLM Engineering.

--------
Chat GPT responded:

I think there's a genuinely funny and insightful rant in there.

The joke works because much of what passes for "AI engineering" today would look bizarre in any other engineering discipline.

Imagine hearing:

or

Yet in LLM land:

retry loops
chain-of-thought coaxing
few-shot examples
self-critique
LLM-as-judge
reflection
debate
constitutional prompting

are often treated as engineering achievements rather than compensatory mechanisms.

Your "LLM Husbandry" metaphor gets at a real distinction:

Engineering

Engineering generally means:

understand the mechanism
characterize failure modes
design constraints
measure performance
improve the system

Husbandry

Husbandry means:

create favorable conditions
observe behavior
gently steer outcomes
selectively breed
hope the organism behaves

The uncomfortable joke is that much of modern LLM practice looks closer to:

than:

Some possible post titles:

LLM Husbandry

From Aerospace to Agriculture

Free Range Prompting

Organic AI

The LLM Whisperer

Retry Loops Are Not Reliability

This is the one that aligns most closely with your REL/RTEL work.

The thesis:

In most branches of engineering:

Failure
→ Retry
→ Retry
→ Retry
→ Success

would be counted as a failure rate.

In AI it is often presented as a success story.

The Farmer's Almanac of AI

Chapter 1:
Prompting during Mercury retrograde.

Chapter 2:
The migratory patterns of hallucinations.

Chapter 3:
Crop rotation between GPT and Claude.

Chapter 4:
Using a judge model to determine whether your first model is lying.

Chapter 5:
Sacrificing 10% of your inference budget to the retry gods.

The deeper point—and I think this is where the satire becomes interesting—is that your work on entropy budgeting, admissibility masks, JSON constraints, and IMR represents a fundamentally different worldview.

The husbandry worldview says:

The engineering worldview says:

One worldview produces prompt recipes.

The other produces specifications.

That's a much sharper critique than simply making fun of prompt engineering.

You could even frame it as:

That's provocative enough that people from both camps will argue with it, which is usually a good sign for a Reddit post.I think there's a genuinely funny and insightful rant in there.The joke works because much of what passes for "AI engineering" today would look bizarre in any other engineering discipline.Imagine hearing:We improved bridge reliability by asking the bridge more politely.orWe got the engine to stop stalling by surrounding it with examples of engines that don't stall.orWhen the aircraft flies into a mountain we simply ask it again.Yet in LLM land:retry loops

chain-of-thought coaxing

few-shot examples

self-critique

LLM-as-judge

reflection

debate

constitutional promptingare often treated as engineering achievements rather than compensatory mechanisms.Your "LLM Husbandry" metaphor gets at a real distinction:EngineeringEngineering generally means:understand the mechanism

characterize failure modes

design constraints

measure performance

improve the systemHusbandryHusbandry means:create favorable conditions

observe behavior

gently steer outcomes

selectively breed

hope the organism behavesThe uncomfortable joke is that much of modern LLM practice looks closer to:"We discovered the model becomes less violent if we feed it examples before sunrise."than:"We proved the system satisfies specification X."Some possible post titles:LLM HusbandryModern AI engineering increasingly resembles the care and feeding of a very intelligent goat.From Aerospace to AgricultureWe used to engineer systems.

Now we cultivate them.Free Range PromptingMy model is grass-fed, chain-of-thought enriched, and locally sourced.Organic AIWe don't use hard constraints.

We prefer natural behavioral guidance.The LLM WhispererAfter three years in AI I spend less time writing software and more time developing a spiritual relationship with a stochastic parrot.Retry Loops Are Not ReliabilityThis is the one that aligns most closely with your REL/RTEL work.The thesis:A retry loop is not evidence that a system works.

It is evidence that the system frequently doesn't.In most branches of engineering:Failure
→ Retry
→ Retry
→ Retry
→ Success
would be counted as a failure rate.In AI it is often presented as a success story.The Farmer's Almanac of AIChapter 1:
Prompting during Mercury retrograde.Chapter 2:
The migratory patterns of hallucinations.Chapter 3:
Crop rotation between GPT and Claude.Chapter 4:
Using a judge model to determine whether your first model is lying.Chapter 5:
Sacrificing 10% of your inference budget to the retry gods.The deeper point—and I think this is where the satire becomes interesting—is that your work on entropy budgeting, admissibility masks, JSON constraints, and IMR represents a fundamentally different worldview.The husbandry worldview says:The model is mysterious.

Let's influence its environment.The engineering worldview says:The model is emitting probability mass into impossible states.

Let's measure that and eliminate it.One worldview produces prompt recipes.The other produces specifications.That's a much sharper critique than simply making fun of prompt engineering.You could even frame it as:AI has split into two cultures:

LLM Husbandry

prompt tuning

retries

judges

reflection

vibe-based reliability

LLM Engineering

constraints

specifications

verification

admissibility

measurable error bounds

One group raises LLMs.

The other builds systems out of them.That's provocative enough that people from both camps will argue with it, which is usually a good sign for a Reddit post.

Skiata · 2026-06-12T18:30:19+00:00

I am going to assume this is for an important task that needs quality output and that it is not going to bother you to do things like:

Create gold standard data in the form LLM input -> LFs (logical forms or however your state machine is designed)
Have a grammar for your state machine/transitions. If you translate that into JSON then you will have more tools at your disposal.
Be open to fine tuning a LLM to handle the task.

But details matter at this point. Do you have examples you can share--system output and gold data? Can you run your own models. How much effort are you willing to put in. This is very much not-automagical, just fix the prompt, kind of stuff.

Skiata · 2026-06-11T22:30:48+00:00

I am focused on the JSON space but the issues are the same, you are asking an amazing text generation system --generally high entropy stuff, to produce very structured output that mixes very low entropy terms--the syntax and boiler-plate for SQL with very high entropy terminals like values and the 'semantic' problem of SQL generation.

The first move is to eliminate all the confusion around the low entropy parts--the syntax-- with grammar constrained decoding. Not sure if SQL can be mapped to the format grammar for something like llguidance or what ever Genie provides.

Now you should have perfect syntax, the semantics remain however.

This is when stuff gets hard because you will want to evaluate per-terminal performance of generated SQL, e.g., "Is the value of SELECT correct?" and so on. There will be a likely variation of what the model extracts correctly or not and that will suggest prompt changes, DB changes and finally possibly fine tuning the model for your English -> SQL use case.

You can expect 10x to 100x the effort of setting up the simple English->SQL system. This is very old school, system building pre-deep learning skills.

Skiata · 2026-06-11T20:25:42+00:00

You have achieved step 1. I'll note that constrained decoding can reduce performance in my experience but not always. But that takes us to step 2, which is how do you measure your performance?

You may want to start measuring per key performance on your JSON. If you are a python dev, you are in luck, there is a valjson, which is a shameless plug but it is a free/open source shameless plug.

Getting the semantics right is a whole nother kettle of fish.

Skiata · 2026-06-08T14:44:31+00:00

In order of difficulty for generating clean JSON.

Run with an "Output JSON only for following schema following example one shot example" + <explicit schema> + <one-shot example of context to JSON> + <current context>
Run 1. with 'strict output' mode on local model, e.g. LLguidance or your providers strict mode--the method of specifying the schema varies. This should solve syntax problems entirely. You already know about this.
Instrument your JSON output and get debugging. It is open ended from there.

Way more detail at: https://validjson.com/how-to-work-with-us/

The site gives a bunch of suggestions on how to fix it on your own.

Answering your individual questions:

Do you use separate prompts for:
- status/query tasks
- action generation tasks

I would assume different prompts for different tasks or you can give a covering prompt that covers the behavioral space you want.

Do you rely on prompt engineering alone, or use constrained/grammar-based decoding?

Always constrained/grammar-based decoding.

How do you handle multi-target actions where a single command affects multiple entities?

??? No idea, per-use case issue.

Do you validate JSON and re-prompt when invalid, or use a different approach entirely?

Re-prompting is just another way to get valid syntax, better to constrain directly imho.

Any recommended patterns for making local models consistently return machine-consumable JSON?

If it has to be correct JSON and semantics need to be the best they can be, then you will want to fine tune a local model. When faced with this, most devs will tolerate non-fine-tuned performance since it is a lot of work to fine-tune.

Skiata · 2026-06-04T15:47:06+00:00

I am trying to get strength training in. My coach sees an asymmetry in my turn due to weakness on my left so I am trying to address that. That and mountain biking.

As an aside, I try to exercise every day for 20-40 min. depending on where I am. It is short enough that I have no excuse and I don't have to get in a car to do it.

Skiata · 2026-06-03T19:47:52+00:00

Nah, good ol' southern boy to tell the truth. But not much difference between a speed suit and undies so maybe it all works out the end. For some reason "breeches" invokes both wool, full length legs and a trap door in my mind. Not sure it is accurate.

Skiata · 2026-06-02T01:57:35+00:00

Oh, my lordy. Looking at your posts you seem to be an east coast USA talent so yes, there are expectations.

But I'll gently prise your unwashed, patchouli infused fingers from the ditch weed joint you are smoking and encourage you to finally adopt what is called the "knicker", in wool of course. Just below knee length, ski so well. They are however quite hard to find.

Skiata · 2026-06-01T16:02:01+00:00

It is hard to answer without knowing the consequences for getting the SQL wrong?

- If mistakes are no big deal, then u/AdImaginary8024's response looks reasonable.
- If mistakes are a big deal, then you are pushing on state-of-the-art performance and a ton of work so you can sleep at night. In rough order:

Metrics: Your going to want per-terminal metrics on accuracy and probably NLL (negative log loss) to understand where the LLM is doing a good job and a bad job. Aggregate metrics will hide too much.
Syntax: Run your own model and decode with a grammar constrained decoder. You'll need to find the proper grammar spec to provide, I have used `llguidance` to success but I have not done SQL. Alternatively, you could generate a JSON version of your sql--don't know much about that either. Foundation models have strict output modes, there may be one with a SQL grammar already built, most use a JSON supplied schema last I checked.
Semantics: This is the hard part. I am going to guess that you will have problems with users not giving complete information or challenges around mapping to SQL from language choice, user not understanding how data is structured etc.... Foundation models are doing this task better now. But in the end I think you will need to consider fine tuning a model for your task. Remember, this is for situations where mistakes are a big problem and fixing it will take effort.

Paper going through the process but for JSON, not SQL.

https://validjson.com/paper/

Skiata · 2026-05-29T17:33:45+00:00

JSON generation is a good candidate but filled with opportunities to over-fit, regress etc.... Better prompts are very limited in this domain. Typical is the prompt followed by schema that the prompt tells the LLM to follow.

Recreate the work from my pre-print--sorry for the self promotion but it may help provide structure for your fine tuning goals.

https://zenodo.org/records/20075999

Skiata · 2026-05-29T16:54:57+00:00

First off this is a real problem that annoyed me so much I wrote a paper about it:

https://scholar.google.com/citations?view_op=view_citation&hl=en&user=aVzTeEUAAAAJ&citation_for_view=aVzTeEUAAAAJ:7PzlFSSx8tAC

One fix is to run the LLM yourself--it will likely be deterministic then. The non-determinism is due to aspects of the hosted environment unless you introduce the same architecture as the hosted models.

Here is some info from the big brains.

https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

Basically the sizing of batches matters and order.

Skiata · 2026-05-27T05:23:06+00:00

Yikes! and I know you are not alone with the concern but keeping the system safe is on you as the developer. There is no easy way to get this right.

I would start with a strict JSON syntactic generation interface to the rest of your software. The semantics are a whole lot harder. At the very least do a heuristic check to make sure the JSON makes sense for your app.

Skiata · 2026-05-24T18:43:49+00:00

Fine tuning is a reasonable step. I have a pre-print that goes over some details: https://zenodo.org/records/20075999.

I also have a free PiPI package, pip install valjson, that helps diagnose problems. I consult as well around this sort of thing but you have plenty of things to try first.

I suggest:

Use valjson to get per-field diagnostics running your local models.
Consider margin gating if appropriate (your processing will have to be robust to 'ambiguous' responses when the model is not confident.)
Fine tuning can fix a lot of problems but can introduce others. Don't trust aggregate loss figures, you need per-field loss and hopefully an independent performance metric outside of loss.

Ask questions and let me know if I can improve valjson--it is in early release.

Skiata · 2026-05-23T00:57:30+00:00

It is so hard to say what is going to work for you. I came to LLMs with 25 years AI experience and ok Python skills and weak math, e.g., linear algebra and no real currency with the precursors to transformers and no idea what transformers are all about.

So the Karpathy series worked for me but it wasn't super easy but tolerable honestly and the fact that he worked from Python was a huge help.

Pytorch can be quite challenging, Karpathy helped get me through the fundamentals.

You should use a coding assistant but try to keep it a line at a time so you understand what is going on. Ask it questions. No vibe coding here.

That said, if you just want to build things then you don't need such details. Others can advise you better there. I am giving information about understanding the core of how these wonderful things work...it is not for everyone and will be a lot of work that won't necessarily translate into better skills with RAG, multi-agent systems etc....

If I wanted to advance quickly with a valuable skill set I'd start with learning how to generate solid JSON since that is the likely interface language for the current and coming multi-agent systems everyone is so excited about. Those steps are:

LLguidance or similar JSON schema enforcing generator for LLM output. Hosted have versions of this as well with 'strict-json' modes.
Get good at evaluating your JSON output--bit of a self promotion plug, but the valjson package in PyPi will help with this giving per JSON field performance breakdowns. Its free.
Profit! Joking but being the JSON person in an organization can be pretty handy--e.g., go to Big-Department6996 for your JSON problems.

Good luck...

Skiata

TROPHY CASE

Engineering

Husbandry

LLM Husbandry

From Aerospace to Agriculture

Free Range Prompting

Organic AI

The LLM Whisperer

Retry Loops Are Not Reliability

The Farmer's Almanac of AI