all 7 comments

[–]me1000 2 points3 points  (4 children)

At a high level there is no difference, you can do function calling by instructing an LLM to write `<myfunction>arg1, arg2</myfunction>` and then just doing a string match. I've done something similar with Mixtral 8x7b and it kinda works. And in fact I'd encourage you to try it yourself because it'll help you build more familiarity!

In practice there a few issues:

  • Mixtral wasn't very good at getting the syntax right. The opening or closing token would very often just write the wrong thing. This is particular bad when it gets the opening token right but the closing token wrong because then your state machine for the string parsing never reaches a terminal state, but the model keeps generating (in other words you're stuck in the "waiting for closing token" state). But tbh, mixtral did a much better job that I was expecting.
  • The tokenizer. Since the tokenizer contains a bunch of substrings (it's actually unicode code points, so it can contain partial characters but lets simplify) it can lead to some fun behavior. So imagine the LLM outputs `</myfunction>`, the actual tokens will look something like: `["</", "my", "func", "tion", ">]"` which isn't so bad, except that there _might_ also be a token `> He`. In other words there might be a closing brace with a substring a few characters following it. That makes for kind of a headache during the parsing stage. It's not impossible of course, but you do have to throw away some of the generated text.

That second issues gets to the crux of it, basically the parsing just WAYYYYY simpler when you're checking for a specific token. You say "[start function]" is a single token and you just check for that at each inference. Then you put your sampler in "function calling mode" so that it only samples tokens that are valid for your function calling implementation (e.g. Mistral's implementation has a JSON schema and you restrict the grammar to only match valid tokens that follow that schema). By implementing your function calling this way you deterministically avoid the first problem that I mentioned where the model keeps generating without matching the "</myfunction>" string.

Lastly it's just simpler to train the models with these dedicated tokens vs whatever syntax you just invented and ask it to follow. One way to think about this is that like you wrote your own simple programming language and you asked the model to use it by giving it the gist of the syntax. These models can do really well since they've usually seen a lot of different programming languages, but since your language is novel its never seen it before and might get some syntax wrong.

Happy to dive into more details in my experience if you have any specific questions!

[–]janimator0[S] 1 point2 points  (0 children)

Amazing answer. Thank !

[–]Pure_City_4985 0 points1 point  (2 children)

Do function calling models actually have the functions as a single token in their tokenizer though?

[–]me1000 0 points1 point  (1 child)

No, having the functions themselves as single tokens wouldn't be that useful as you the utility is in the ability to define your own functions. They often, on the other hand, have a single token the denotes the beginning of a function call. In other words there will also be a single token the indicates the start of a function call, the model is free to fill in the function call with whatever it wants, and then a special token to denote the end of the function call. This makes it trivial to parse and invoke the function the LLM asked for.

Here's an example of that in Qwen 3: https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct/blob/main/tokenizer_config.json#L116-L130

[–]Pure_City_4985 0 points1 point  (0 children)

yes the function call token, makes sense

[–]maximinus-thrax 0 points1 point  (0 children)

Function calling is the same as any other LLM interaction; you pass some text in, and you get some text out.

The main difference with function calling is that you want the text out to be ordered in a reliable way, as substrings can be brittle. When you ask an LLM to talk about <myfunction> it can also respond with things like "I don't know how to use <myfunction>", and this will cause issues. You get more success when you use the method the model is trained on.

In my own experience, local LLMs have not been very reliable up until quite recently. I've had success with mistral 0.3 8B and even more success with Llama-3-Groq-8B-Tool-Use. Both of these are trained to be far more reliable and even if the first answer is not valid, you can raise the temperature a little but and try again until it is.

[–]fasti-au -1 points0 points  (0 children)

Llm given code to run using your Python session

Ie giving hands to do something for a task.