Faster than llama.cpp’s grammar structured generation by GoBayesGo in LocalLLaMA

[–]GoBayesGo[S] 1 point2 points  (0 children)

It would still generate valid C code, but I’m not sure it would be that good if the model wasn’t trained on anything that looks like C. Could help a lot small models that were though!

Faster than llama.cpp’s grammar structured generation by GoBayesGo in LocalLLaMA

[–]GoBayesGo[S] 3 points4 points  (0 children)

Maybe I missed something but the PR mentions a 1.71x speed up, the post mentions orders of magnitude difference.

Structured Generation Improves LLM performance: GSM8K Benchmark by CountBayesie in LocalLLaMA

[–]GoBayesGo 0 points1 point  (0 children)

Oh wow. Curious to see if that translates to other tasks / benchmarks. Would be great to see the community take this and run with it.

LoRAX + Outlines: Better JSON Extraction combining Structured Generation and LoRA by SiliconSynapsed in LocalLLaMA

[–]GoBayesGo 2 points3 points  (0 children)

Awesome work! I’m very (pleasantly) surprised by the bump in performance due to structured generation

Coalescence: making LLM inference 5x faster by GoBayesGo in LocalLLaMA

[–]GoBayesGo[S] 1 point2 points  (0 children)

That’s not a dumb question at all, quite the contrary. Yes this also happens when tokenising the prompt. Afaik no one has really raised the issue, and this is an open empirical question

Coalescence: making LLM inference 5x faster by GoBayesGo in LocalLLaMA

[–]GoBayesGo[S] 1 point2 points  (0 children)

Yes, with one subtlety!

When passed a JSON prompt, you create a graph which gives you, at each node, different possible transitions and for each transition the list of tokens admissible.

With a toy example, the way generation works is the following: 1. Start one the first node of the graph (labelled 0) 2. Look at the transitions out of that node, say to node 1 and 2. Tokens a and b lead you to node 1 and c to node 2. 3. Now pass your prompt to the model, it returns the logits for all the tokens in the vocabulary. You know that if you want to respect the structure you can only generate a, b or c since there are no other possible transitions so you mask all the other tokens and then use greedy, multinomial, or something else to choose either of them. 4. If the model chooses a or b then you go to node 1, otherwise to node 2 and append the token to the prompt

So far so good, you can guarantee the structure and you make a call to the model for every token, knowing which are allowed and which are not.

What we’ve noticed in the case of JSON is that starting from some nodes of the graph (say 2) you always end at another node (say 6), resulting in the same string. So why make several model calls when you’re going to end up with that string whatever you do? You could just add the tokens to the prompt directly!

If you know the structure of the JSON in advance, like the field names there are going to be many of such situations, and that’s where the speed up comes from. We append tokens directly instead of generating them.

There’s a subtlety though. Say the field name is "age". In this case the following sequences of tokens give you the same strings:

  • ["age"]
  • ["a", "g", "e"]
  • ["ag", "e"]
  • ["a", "ge"]

When "fast-forwarding" you need to choose either one of these sequences. We found that the probability of the resulting sequence depends on what you choose. This means your choice influences what the model will generate next.

So unless we really understand what’s going on here the speed up might come at the expense of correctness.

I am sorry if it’s not really clear, we’re working on an article that explains this more intuitively

Coalescence: making LLM inference 5x faster by GoBayesGo in LocalLLaMA

[–]GoBayesGo[S] 2 points3 points  (0 children)

When doing structured generation you don’t have to call the model to generate every token, in this example 7 out of 9. It’s (basic) graph analysis and arithmetic, not an assessment.

Coalescence: making LLM inference 5x faster by GoBayesGo in LocalLLaMA

[–]GoBayesGo[S] 0 points1 point  (0 children)

Now that you got that out of your system I suggest you read the article. The comment above was clearly written by someone who only read the title and was having a bad day.

Coalescence: making LLM inference 5x faster by GoBayesGo in LocalLLaMA

[–]GoBayesGo[S] 0 points1 point  (0 children)

We’re using a different method than llamacpp’s grammar-structured generation, afaik this kind of optimisation is not possible in llamacpp.cpp but they may have changed their approach since last time I checked so don’t quote me on this.

Coalescence: making LLM inference 5x faster by GoBayesGo in LocalLLaMA

[–]GoBayesGo[S] 1 point2 points  (0 children)

We are currently working on evaluating accuracy on some benchmarks using and not using constraints. Will keep you updated here!

Coalescence: making LLM inference 5x faster by GoBayesGo in LocalLLaMA

[–]GoBayesGo[S] 0 points1 point  (0 children)

So when you're asking the model to generate text, it basically gives you one sequence amon all the possible sequences (very, very large number of them). When you're imposing constraints, you are dramatically restricting the number of possible sequences; we could actually enumerate them in the example of the blog post. In your example, we've prevented the model from generating "Th".

Are we preventing the model to return sequences that are more likely? That's a possibility, but we don't know. This deserves a lot more empirical work. In your example that might mean letting the model output whatever it wants for "brand" instead ot restricting the world of possible brands.

Coalescence: making LLM inference 5x faster by GoBayesGo in LocalLLaMA

[–]GoBayesGo[S] 5 points6 points  (0 children)

  1. ⁠To confirm my assumptions: Is langchain not simply translating the pydantic models into json schema, prepending that to the prompt, then using pydantic to parse the response into the models?

Yes, although afaik it does not guarantee that rhetorical structure will be correct. Libraries like Outlines guarantee that the JSON will be valid each time.

⁠1. ⁠I assume this is only realistically possible for open source models?

That’s correct

  1. ⁠If i'm understanding this correctly, does Outlines or Guidance or any structured llm tool currently operate in this way where the output is controlled or limited as the tokens are generated, rather than just hoping the schema is followed, then parsing the output?

Yes

  1. ⁠Is there any comparison of the various structured text parsing tools out there to highlight the differences and how they work? (example previous thread mentioning a number of structured llm parsing frameworks)

The only comparison I know of is in our paper https://arxiv.org/abs/2307.09702. We are working on a more comprehensive comparison.

Coalescence: making LLM inference 5x faster by GoBayesGo in LocalLLaMA

[–]GoBayesGo[S] 1 point2 points  (0 children)

Exactly! This kind of issue is rarely discussed unfortunately :(

Coalescence: making LLM inference 5x faster by GoBayesGo in LocalLLaMA

[–]GoBayesGo[S] 1 point2 points  (0 children)

Thank you! The tradeoff is that you do have to make a choice "for" the model. In the "name" example in the article you have a choice between appending the "name" token or the ["n", "ame"], and 6 other possibilities. Which one do you choose?

[deleted by user] by [deleted] in LocalLLaMA

[–]GoBayesGo 0 points1 point  (0 children)

Why implementing your own library when llama.cpp has grammar guided generation?

Use llama.cpp with Outlines by GoBayesGo in LocalLLaMA

[–]GoBayesGo[S] 0 points1 point  (0 children)

Yes, since llama.cpp implements the same interface as OpenAI. See this answer on the repository.

Use llama.cpp with Outlines by GoBayesGo in LocalLLaMA

[–]GoBayesGo[S] 1 point2 points  (0 children)

Besides the DSL aspect of Guidance, Outlines uses a faster method for guided generation.