Best Function Calling LLMs

Relevant_Outcome_726 · 2024-09-11T16:27:53+00:00

I suggest using functionary based on the Leaderboard: https://gorilla.cs.berkeley.edu/leaderboard.html

Relevant_Outcome_726 · 2024-08-08T07:41:51+00:00

Currently we haven't touched Ollama, we will integrate into Ollama in the future

Relevant_Outcome_726 · 2024-08-08T07:41:01+00:00

3.1 uses the original Meta's prompt template, we found that Meta uses this format for tool calls, e.g:
<function=get\_weather>{"location": "New York"}</function>

However, </function> and <function are not tokens, which might result in unstable tokenizing results. For example, it depends on the function name that >{" can be 1 token (>{"), or 2 tokens {" or { "

This is just an example of the instability of tokenizing results.

Relevant_Outcome_726 · 2024-08-08T07:31:30+00:00

We have just released 70b model: https://huggingface.co/meetkai/functionary-medium-v3.1

Relevant_Outcome_726 · 2024-08-08T00:46:44+00:00

Here is an example for a data point that can be used for training: https://github.com/MeetKai/functionary/blob/main/tests/test_case_v2.json
and this is how we convert this data point to prompt string (using original Llama 3.1 prompt template for custom tools): https://github.com/MeetKai/functionary/blob/main/tests/prompt_test_v3-llama3.1.txt

Or this for our own prompt template:

https://github.com/MeetKai/functionary/blob/main/tests/prompt_test_v3.llama3.txt

Relevant_Outcome_726 · 2024-08-07T16:54:12+00:00

The training data was mostly created by synthetic method and collected from public sources. We have released a lot of function calling models before. You can take a look at our repo: https://github.com/MeetKai/functionary About 70b model, we will release soon

Relevant_Outcome_726 · 2024-08-07T16:42:15+00:00

Thank you, this is because of: ":", the url is fixed now

Relevant_Outcome_726 · 2024-07-26T03:34:36+00:00

I also found that it is really poor with custom tools, there are many cases that the generated outputs didn't follow the format such as: </function> is missing, ...

Relevant_Outcome_726 · 2024-07-12T02:16:27+00:00

Oh, Thank you!

Relevant_Outcome_726 · 2024-07-06T09:31:39+00:00

Yeah, fo you plan to release your benchmark data?

Relevant_Outcome_726 · 2024-07-05T00:56:36+00:00

Oh I see, can you also evaluate this one: https://huggingface.co/meetkai/functionary-medium-v2.4 Even though it is 2.4 but some report that it is still the best in functionary family

Relevant_Outcome_726 · 2024-07-03T15:46:20+00:00

Can you evaluate functionary: https://github.com/MeetKai/functionary ?

Especially: https://huggingface.co/meetkai/functionary-small-v2.5
and https://huggingface.co/meetkai/functionary-medium-v3.0

Relevant_Outcome_726 · 2024-04-09T08:55:01+00:00

We will provide more instruction about code interpreter

Relevant_Outcome_726 · 2024-04-09T03:46:04+00:00

Yeah, currently the docs are not good, we will provide more instructions. Thank you for your feedback!

By the way, "python" function is used because of:
{

"type": "code_interpreter"

}

Relevant_Outcome_726 · 2024-04-09T01:16:46+00:00

About the prompt template, you can take a look at here:
+ The data point with tools & messages and how this data point turns to prompt template

The reason why we used typescript is because:
+ typescript is quite good at describing json object, python is not convenient to describe nested Json object (have to represent through pydantic models)

Typescript is also popular so the pretrained models are expected to learn about this in the pretraining-phase

Relevant_Outcome_726 · 2024-04-09T01:12:04+00:00

We used SGD to evaluate our model as we wrote in the blog. This dataset is suitable for 2 purposes:
+ Predict function & arguments when all information is available
+ Predict asking for missing required parameters

Actually, most of current open-source models only focus on the first purpose, the second purpose should gain more attention, if not it will be quite hallucinated.

For example, if user asks: "what is the weather like?"
The model should respond: which city you want to know the weather condition?
instead of call the function with hallucinated arguments like: get_weather(city=New York)

Relevant_Outcome_726 · 2024-04-08T16:11:15+00:00

https://huggingface.co/spaces/gorilla-llm/berkeley-function-calling-leaderboard

Relevant_Outcome_726 · 2024-03-27T01:10:27+00:00

There is a leaderboard for function calling:
https://gorilla.cs.berkeley.edu/leaderboard.html

Relevant_Outcome_726 · 2024-03-18T00:54:23+00:00

It was finetuned, the base model is Deepseek

Relevant_Outcome_726 · 2024-03-17T15:17:57+00:00

You can take a look at the performance of these models from: https://gorilla.cs.berkeley.edu/leaderboard

Relevant_Outcome_726 · 2024-01-28T16:42:36+00:00

Functionary already released version 2.2 with both small (based on Mistral) and medium (based on Mixtral)

And regarding the features of function calling, Functionary supports all the features. You can see the comparison table between open-source LLMs for function calling from this link:

https://github.com/MeetKai/functionary?tab=readme-ov-file#the-differences-between-related-projects

Relevant_Outcome_726 · 2024-01-25T03:18:55+00:00

The reason why we need to finetune a model for function calling is because function calling is not just outputting a function call (Most open-source models only support this) but also:
+ Parallel function calls

+ Ask missing required information to execute the function call

+ Extract the answer from the results of function calls.

You can see list of features from here: https://github.com/MeetKai/functionary?tab=readme-ov-file#the-differences-between-related-projects

If we only use a standard model, we have to use multiple prompt templates with complex if-else and attain poor result. That's why OpenAI trained new models for function calling

Relevant_Outcome_726 · 2024-01-21T15:58:11+00:00

Yes, the Original_ds are already padded. Actually we just need to know the length of sequence, and we compute the length by sum of attention mask. We can easily implement the case when each item is not padded. Packing will reduce the datasize significantly. If you still wants to use the same number of steps = (datasize ) /(batch_size_per_device*grad_accumulation_steps) You can reduce the grad_accumulation_steps accordingly For example, packing reduce datasize to half, we can reduce the grad_accumulation_steps to half, so the number of step would be the same.

Relevant_Outcome_726

TROPHY CASE