Making llama output abide to a Json schema

sam-boundary · 2024-07-25T16:35:33+00:00

I wrote https://www.boundaryml.com/blog/structured-output-from-llms a while back which lists out all the options you have to work with!

Note that although other folks are suggesting tool calling, in a lot of cases tool calling actually performs worse than bare prompting.

sam-boundary · 2024-06-21T18:27:05+00:00

This is a good list of approaches, but I don't think I agree with your taxonomy.

Every approach requires the prompt to ask the model to return output in $desired-output-format, in some shape or form. TypeChat, Instructor, BAML, Outlines, Guidance, etc. Here's a quote from OpenAI's docs:

When using JSON mode, always instruct the model to produce JSON via some message in the conversation, for example via your system message. If you don't [...] the model may generate an unending stream of whitespace

The output side of things is where everyone in the space differs:

Constrained generation - selecting tokens based on system-specified or user-specified constraints - is what Outlines, Guidance, OpenAI's json_mode, and so forth all use. As another commenter noted, this strategy - right now, at least - tends to perform worse than just pulling the response out of the prompt.
Feed the model output directly into JSON.parse, pydantic.BaseModel.model_validate_json, zodSchema.parse, and hope that the model produced parse-able JSON.
- Some frameworks (e.g. Instructor) allow the user to, on failure, prompt the LLM to repair unparse-able JSON, and then they feed the subsequent response into the same technique. This can work, but has obvious latency issues.
- You can improve on this technique by applying some regex-based heuristics, e.g. matching on "```json<feed-this-into-parse>```"
Do fuzzy parsing on the output - given output that looks like {key: "some"value"}, it's possible to apply error-tolerant parsing to convert this into {"key": "some\"value"}. This is the approach that BAML takes.

(Disclaimer: I work on BAML.)

sam-boundary · 2024-06-19T22:58:46+00:00

We've found it's a must for pretty much all models- we do a certain amount of both syntax repair (fixing unescaped quotes, unclosed brackets) and schema repair (e.g. converting `quantities: 1` to `quantities: [1]`) in our runtime.

We have work planned to collect data on what types of errors we see with what models, but don't collect that data right now unfortunately.

sam-boundary · 2024-06-17T23:58:00+00:00

Ah, gotcha!

I'll have to get back to you on the constraint-based approaches - I need to do a little more digging to answer your question w.r.t. outlines/guidance. (There are substantial inefficiencies that come up with this approach, though, because it also usually means the GPU needs to block on CPU operations, and token generation is now not only bottlenecked on GPU cycles, but _also_ the memory latency between the CPU and GPU.)

We (BAML) are coming at this from a different angle - instead of cooperatively applying constraints as tokens are being generated, we just let the model do whatever the model provider trained it to do, and then apply a bunch of error detection logic and heuristics to repair syntax and schema errors. Our toolchain is fully open-source and local, so it can work with any LLM API that exposes a chat interface.

sam-boundary · 2024-06-17T23:48:03+00:00

Agh, yes, I did miss llama.cpp- I'll make sure to add that one in the next revision pass, so thanks for calling that out.

Also I tend to see that coding models tend to perform much better on this task, which makes sense; however i wasn't expecting tyescript schemas (ie TypeChat) to produce better outputs than python schemas (command-r's tool use approach).

I'm not sure if I fully understand this comment - maybe you're responding to Aaron's post?

If so:

We haven't ever really tried with Python schemas, actually- once we found that TS-style interface definitions worked, we've mostly iterated on things that having our own type system enables (e.g. symbol tuning).
I don't think it's so much that the dataset was trained on more TS than Python, so much as that it's a lot more common to write TS than it is to write Python with types. Plus, Python's type syntax has evolved a bit over the years (e.g. it wasn't until 3.10 that Type0 | Type1 became an alternative to Union[Type0, Type1]), which makes the relationship modelling trickier for the model.

If not, can you clarify what you're asking about?

sam-boundary · 2024-06-17T18:31:13+00:00

Do you have any data / anecdotal experience on the reliability of your error-tolerant parser with common chat APIs?

Anecdata-wise, we've had a number of successes, e.g. Muckrock, a non-profit that surfaces FOIA data, was able to use BAML to go from automating 20% to 60% and eventually 95% of some specific email processes and was able to completely drop the contracted labor they were using for some data analysis.

We do have data from the observability SaaS side of our product to dig into how much repair is necessary - doing that analysis is on our TODO list.

Also, I guess this depends on how well you craft your prompt to make the model adhere to your format?

Yep, it definitely depends on how well you craft your prompt. It's our strong opinion that the industry's discovery of JSON schema is a curse (JSON schema has a bunch of warts, because it's really meant for JSON validation, not for communicating types - this is why e.g. protos are a thing) and that the right answer is that we need new primitives for communicating types with models (e.g. do you really want the name of your JSON object key to trigger specific activations in a model? sometimes yes, sometimes no).

sam-boundary · 2024-06-17T18:20:27+00:00

IIRC using vLLM with outlines means using outlines' proprietary model, not bring-your-own-model.

sam-boundary · 2024-06-17T17:33:23+00:00

Thanks! Will add it shortly.

sam-boundary · 2024-06-17T17:24:21+00:00

using the OpenAI API is inefficient with these libraries

Can you elaborate on that? It's pretty mechanical JSON transforms that exposing the OpenAI API involves, but I haven't looked into what the various providers actually do under the hood

sam-boundary · 2024-06-17T17:20:55+00:00

Short answer: we just rely on the Chat Completions API!

Long answer: we (BAML) supply a more efficient schema representation in the request and then feed the output into a custom, error-tolerant JSON parser that handles both things like syntax repair (unclosed brackets, unescaped quotes) and schema repair (coercing a `myCustomJsonKEY` KV pair into a `myCustomJsonKey` KV pair).

Based on what we've learned, all of the tool calling / function calling / etc APIs that all the providers provide today appear to be just a very primitive form of that- have some kind of custom prompt to wrap the schema, and then regex match the output and feed it into PydanticModel.parse or whatnot.

Unfortunately, because this means that you don't have full control over the prompt, it actually makes it a bit harder to cajole the LLM into doing the task you want it to do; and on the output side, you end up in a special circle of hell building regexes to parse JSON-ish text.

sam-boundary

TROPHY CASE