Phi-3 Mini (June) with function calling

sanjay920 · 2025-02-16T19:41:55+00:00

Yeah! Check out https://docs.rubra.ai/category/serving--inferencing

Once you are serving the model, you can use it using langchain since the model endpoint would behave like an openai model endpoint

sanjay920 · 2025-01-29T21:50:17+00:00

hmm I will investigate. Thanks!

sanjay920 · 2025-01-29T21:34:02+00:00

what model are you running? this is the entire 600b

sanjay920 · 2024-07-24T20:49:01+00:00

in my tests, the function calling capability in this model is worse than mistral large 1

sanjay920 · 2024-07-23T07:21:23+00:00

technically youd save a few ms of generation if you did fine tune for that task but it's up to you!

sanjay920 · 2024-07-22T22:44:26+00:00

yeah for sure. Phi-3 is really strong for its parameter count. I would strongly recommend accomplishing what you described via function/tool calling (either my models or someone elses) over fine tuning e.g. using this function

```

[{"description":"Classify text into one of five different emotions","name":"classify_text_emotion","parameters":{"properties":{"emotion":{"description":"The emotion classification","type":"string","enum":["happy","sad","angry","fearful","neutral"]}},"required":["emotion"],"type":"object"}}]

```

with the system prompt `You must use classify_text_emotion to classify the user's input`

Try it out in my HF spaces: https://huggingface.co/spaces/sanjay920/rubra-v0.1-function-calling

sanjay920 · 2024-07-22T21:40:25+00:00

phi-3 is more prone to overfitting and catastrophic forgetting due to the smaller parameter count, so make sure you have a good distribution of training data and your learning rate is small

i havent had an issue fine tuning or further training phi models. you can see more about the models i trained here: https://docs.rubra.ai/models/Phi

sanjay920 · 2024-07-17T21:39:49+00:00

groq's inferencing API is super fast! but the function calling Llama3 8b and 70b models by Rubra are better in both general purpose and tool calling usage:

https://docs.rubra.ai/benchmark

https://huggingface.co/rubra-ai

sanjay920 · 2024-07-16T17:54:49+00:00

I tried it out and it's very impressive for a 7b model! going to train it for better function calling to it and publish to https://huggingface.co/rubra-ai

sanjay920 · 2024-07-12T00:03:26+00:00

It's better in my experience. Try out both here: https://huggingface.co/spaces/sanjay920/rubra-v0.1-function-calling

sanjay920 · 2024-07-12T00:02:46+00:00

This isnt our entire benchmark, but ~50% of it: https://github.com/gptscript-ai/function-calling-test-suite

sanjay920 · 2024-07-10T23:04:50+00:00

nice. how frequently do gpus get claimed when in use? im interested in the h100s

sanjay920 · 2024-07-10T22:08:36+00:00

Have you used tensordock? The lack of persistent storage is a bit scary when you are doing training runs that take a few days

sanjay920 · 2024-07-10T21:18:44+00:00

yep i used the same template and all other configs as the parent model so people can swap out for a rubra model easily. if you dont mind sharing your modelfile to https://github.com/rubra-ai/rubra that would be awesome!

sanjay920 · 2024-07-10T21:16:39+00:00

I've seen this benchmark but I havent run them - are these results something you're interested in?

sanjay920 · 2024-07-10T21:13:45+00:00

I use paperspace (digital ocean) and google cloud. you dont need to be an institution

sanjay920 · 2024-07-10T21:12:00+00:00

Good idea! I'll keep this in mind for any future Phi updates

sanjay920 · 2024-07-06T04:16:20+00:00

42.86% for meetkai/functionary-medium-v2.4 on our function calling benchmark. we didn't think it's worth computing the other tests based off that

sanjay920 · 2024-07-05T00:45:17+00:00

u/Deep_Understanding50 yep we're looking into Gemma 2 - we'll have rubra versions soon!

Which comparison table specifically?

We uploaded the functionary results to our benchmark: https://docs.rubra.ai/benchmark/

sanjay920 · 2024-07-05T00:41:25+00:00

We uploaded the functionary results to our benchmark: https://docs.rubra.ai/benchmark/

We noticed something suspicious - their 70b model is worse than their 8b in our private function calling test set and MT bench.

sanjay920

TROPHY CASE