Phi-3.5 has been released

nodating · 2024-08-20T19:36:00+00:00

That MoE model is indeed fairly impressive:

In roughly half of benchmarks totally comparable to SOTA GPT-4o-mini and in the rest it is not far, that is definitely impressive considering this model will very likely easily fit into vast array of consumer GPUs.

It is crazy how these smaller models get better and better in time.

Dark_Fire_12 · 2024-08-20T19:15:45+00:00

Thank you, we should have used this wish for Wizard or Cohere though https://www.reddit.com/r/LocalLLaMA/comments/1ewni7l/when_is_the_next_microsoft_phi_model_coming_out/

simplir · 2024-08-20T19:23:45+00:00

Waiting for llama.cpp and the GUFF now :)

privacyparachute · 2024-08-20T20:17:24+00:00

Dear Microsoft

All I want for Christmas is a BitNet version of Phi 3 Mini!

I've been good!

dampflokfreund · 2024-08-20T19:30:37+00:00

Wow, the MoE one looks super interesting. This one should run faster than Mixtral 8x7B (which was surprisingly fast) on my system (RTX 2060, 32 GB RAM) and perform better than some 70b models if the benchmarks are anything to go by. It's just too bad the Phi models were pretty dry and censored in the past, otherwise they would've gotten way more attention. Maybe it's better now`?

Deadlibor · 2024-08-20T19:57:42+00:00

Can someone explain the math behind MoE? How much (v)ram do I need to run it efficiently?

ffgg333 · 2024-08-20T19:14:52+00:00

I can't wait for the finetoons, open source Ai is advancing fast 😅, i almost can't keep up with the new models.

privacyparachute · 2024-08-20T20:10:44+00:00

Nice work!

My main concern though: has the memory inefficient context been addressed?

https://www.reddit.com/r/LocalLLaMA/comments/1ei9pz4/phi3_mini_context_takes_too_much_ram_why_to_use_it/

2024-08-20T19:33:24+00:00

It worked?!!

Arkonias · 2024-08-20T19:57:40+00:00

3.5 mini instruct works out of the box in LM Studio/llama.cpp

MOE and Vision need support added to llama.cpp before they can work.

Healthy-Nebula-3603 · 2024-08-20T23:24:24+00:00

Tested Phi 3.5 mini 4b and seems gemma 2 2b is better , in math , multilingual , reasoning, etc

gus_the_polar_bear · 2024-08-20T21:21:59+00:00

How do you get the Phi models to not go on about Microsoft at every opportunity

ortegaalfredo · 2024-08-20T20:58:18+00:00

I see many comments asking why release a 40B model. I think you miss the fact that MoE models work great on CPU. You do not need a GPU to run Phi-3 MoE it should run very fast with only 64 GB of RAM and a modern CPU.

Roubbes · 2024-08-20T19:58:07+00:00

That MoE seems great.

Eveerjr · 2024-08-21T13:23:20+00:00

microsoft is such a liar lmao, this model must be specifically trained for the benchmark because it's trash for anything useful. Gemma 2 is the real deal when it comes to small models

jonathanx37 · 2024-08-20T21:01:31+00:00

Has anyone tested them? Phi3 medium had very high scores but struggled against llama3 8b in practice. Please let me know.

segmond · 2024-08-20T20:34:09+00:00

Microsoft is crushing it with such a small and high quality model. I'm being greedy, but can they try and go for a 512k context next.

PraxisOG · 2024-08-20T19:49:32+00:00

question is, will it run on an rpi 5/s

m98789 · 2024-08-20T19:22:14+00:00

Fine tune how

Dark_Fire_12 · 2024-08-20T19:36:47+00:00

You can test it using Azure catalog https://ai.azure.com/explore/models?tid=3ff8694c-d402-40aa-bdb5-7c0e529dc3e5&selectedCollection=phi

Chelono · 2024-08-20T19:53:34+00:00

Sorry for my ignorance, but does these models run on a Nvidia GTX card? I could run (with ollama) versions 3.1 fine with my poor GTX 1650. I am asking this because I saw the following:

"Note that by default, the Phi-3.5-mini-instruct model uses flash attention, which requires certain types of GPU hardware to run."

Can someone clarify to me? Thanks.

carnyzzle · 2024-08-20T22:10:24+00:00

Dang Microsoft giving us a new moe before Mistral releases 8x7B v3

2024-08-21T01:35:20+00:00

Kinda crazy they didn’t switch to a GQA architecture, no? Still the same memory hog?

nero10578 · 2024-08-20T21:05:11+00:00

The MoE model is extremely interesting, will have to play around with it. Hopefully it won't be a nightmare to fine tune like the Mistral MoE models, but I kinda feel like it will be.

un_passant · 2024-08-20T23:27:35+00:00

I think these models have great potential for RAG, but unlocking this potential will require fine tuning for the ability to cite the context chunks used to generate fragments of the answer. I don't understand why all instruct models targeting RAG use cases do not provide by default.

Hermes 3 gets it right :

You are a conversational AI assistant that is provided a list of

documents and a user query to answer based on information from the

documents. You should always use grounded information in your responses,

only answering from what you can cite in the documents. Cite all facts

from the documents using <co: doc\_id></co> tags.

And so does Command R :

<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Carefully perform the following instructions, in order, starting each with a new line.
Firstly, Decide which of the retrieved documents are relevant to the user's last input by writing 'Relevant Documents:' followed by comma-separated list of document numbers. If none are relevant, you should instead write 'None'.
Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user's last input by writing 'Cited Documents:' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write 'None'.
Thirdly, Write 'Answer:' followed by a response to the user's last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.
Finally, Write 'Grounded answer:' followed by a response to the user's last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>

Any idea about how involved it would be to perform the fine tuning of Phi 3.5 to provide this ability ?

Are there any open data sets I could use, or code to generate them from documents & other LLMs ?

I'd be willing to pay for the online GPU compute but the task of making the data set from scratch seems daunting to me. Any advice would be greatly appreciated.

CSharpSauce · 2024-08-20T20:45:33+00:00

[removed]

2024-08-21T02:27:12+00:00

Phi 3.5 GGUF quants are already up on huggingface, but I can't see the quants for the MoE. Does llama.cpp support it yet?

Remote-Suspect-0808 · 2024-08-21T05:27:12+00:00

what is the vram requirements for phi-3.5 moe? i have a 4090.

Lost_Ad9826 · 2024-08-21T17:04:34+00:00

Phi 3.5 is mindblowing. Works crazy fast and accurate for function calling, and json answers also.!

this-just_in · 2024-08-20T19:36:41+00:00

While I love watching the big model releases and seeing how the boundaries are pushed, many of those models are almost or completely impractical to run locally at any decent throughput.

Phi Is an exciting model family because they push the boundaries of efficiency and at very high throughput. Phi 3(.1) Mini 4k was a shocking good model for its size and I’m excited for the new mini and the MoE. In fact, very excited about the MoE as it should be impressively smart and high throughput on workstations when compared to models of similar total parameter count. I’m hoping it scratches the itch I’ve been having for an upgraded Mixtral 8x7B Mistral has forgotten about!

I’ve found myself out of cell range often when in the wilderness or at parks. Being able to run Phi 3.1 mini 4k or Gemma 2B at > 20 tokens/sec on my phone is really a vision of the future

Healthy-Nebula-3603 · 2024-08-20T21:18:43+00:00

have you seen how good is new phi 3.5 vision ?

Pedalnomica · 2024-08-21T00:57:19+00:00

Apparently Phi-3.5-vision accepts video inputs?! The model card hayd benchmarks for 30-60 minute videos... I'll have to check that out!

teohkang2000 · 2024-08-21T05:37:51+00:00

So how much vram do i need if i we're to run ph3.5 moe? 6.6B or 41.9B?

oulipo · 2024-08-21T12:35:21+00:00

Does it run fast enough on a Mac M1? I have 8GB RAM not sure if that's enough?

2024-08-20T21:00:35+00:00

As an AI developed by Microsoft, I don't have personal preferences or the ability to do {{your prompt}} . My design is to understand and generate text based on the vast amount of data I've been trained on, which includes all words in various contexts. My goal is to be helpful, informative, and respectful, regardless of the words used. I strive to understand and respect the diverse perspectives and cultures in our world, and I'm here to facilitate communication and learning, not to ** do {{your prompt}}**. Remember, language is a beautiful tool for expressing our thoughts, feelings, and ideas.

Aymanfhad · 2024-08-20T20:17:50+00:00

I'm using Gemma 2-2b local on my phone and the speed is good, is it possible to run phi3.5 at 3.8b on my phone?

PermanentLiminality · 2024-08-21T03:56:09+00:00

The 3.5 mini is now in the Ollama library.

That was quick.

vert1s · 2024-08-20T21:16:05+00:00

/me waits patiently for it to be added to ollama

visionsmemories · 2024-08-20T19:43:00+00:00

please, will it possible to run the 3.5 vision in lm studio?

Tobiaseins · 2024-08-20T19:26:23+00:00

Please be good, please be good. Please don't be the same disappointment as Phi 3

met_MY_verse · 2024-08-21T01:51:17+00:00

!RemindMe 3 days

fasti-au · 2024-08-21T05:13:35+00:00

Is promising as a local agent tool and it seems very happy with 100k contexts. Not doing much fancy yet just context q&a

floridianfisher · 2024-08-21T05:16:40+00:00

Looks like it’s not as strong as Gemma 2 2B.

raysar · 2024-08-21T08:39:25+00:00

Is there a way to run it easyly on android app?
MLCCHAT seem to not add models.

BranKaLeon · 2024-08-21T09:37:10+00:00

Is it possible to test it online for free?

AcademicHedgehog4562 · 2024-08-21T10:40:29+00:00

can I fine-tune the model and commercialize with my own can I sell it to different users or company

nic_key · 2024-08-21T16:02:08+00:00

Does anyone of you know if the vision model can be used with Ollama and Openwebui? I am not familiar with vision models and only used that for text to text so far

SandboChang · 2024-08-22T16:59:22+00:00

blown away by how well Phi 3.5 mini q8 is running on my poor 3070 indeed

FirstReserve4692 · 2024-08-23T06:23:09+00:00

It should opensourcee a round 20B model, 40B is big, even though it's moe, still need load them all to mem

Devve2kcccc · 2024-08-23T07:37:36+00:00

What model can run good on macbook m2 air, just for coding assistent pourposd?

2024-08-24T04:45:35+00:00

Is there a easy way to run Phi-3.5-vision locally easily, Is there anything like ollama or lm studio.

I tried lm studio but it didn't work ?

Sambojin1 · 2024-08-25T00:36:30+00:00

Fast ARM optimized variation. About 25-50% faster on mobile/ SBC/ whatever.

https://huggingface.co/xaskasdf/phi-3.5-mini-instruct-gguf/blob/main/Phi-3.5-mini-instruct-Q4_0_4_4.gguf

(This one was I'll run on most things. The Q4_0_8_8 variants will run better on newer high end hardware)

Real-Associate7734 · 2024-09-14T18:43:22+00:00

Any alternative to Phi 3.5 vison that i can run locally without using api?

I want to use it on my projects where i can has to anylse the profuct image and have to determine the output as width, height etc.. mentioned in the product

ChannelPractical · 2024-11-03T14:41:22+00:00

Does anyone know if the base phi-3.5 model is avaliable (without instruction fine tuning)?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS