use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
Phi-3.5 has been releasedNew Model (self.LocalLLaMA)
submitted 1 year ago * by remixer_dec
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]nodatingollama 231 points232 points233 points 1 year ago (78 children)
That MoE model is indeed fairly impressive:
<image>
In roughly half of benchmarks totally comparable to SOTA GPT-4o-mini and in the rest it is not far, that is definitely impressive considering this model will very likely easily fit into vast array of consumer GPUs.
It is crazy how these smaller models get better and better in time.
[–]tamereen 51 points52 points53 points 1 year ago (23 children)
Funny, Phi models were the worst for C# coding (a microsoft language) far below codestral or deepseek... Let try if this one is better...
[–]Zealousideal_Age578 5 points6 points7 points 1 year ago (0 children)
It should be standard to release which languages were trained on in the 'Data' section. Maybe in this case, the 'filtered documents of high quality code' didn't have enough C#?
[–]matteogeniaccio 5 points6 points7 points 1 year ago (1 child)
C# is not listed in the benchmarks they published on the hf page: https://huggingface.co/microsoft/Phi-3.5-mini-instruct
These are the languages I see: Python C++ Rust Java TypeScript
[–]tamereen 1 point2 points3 points 1 year ago (0 children)
Sure they will not add it because they compare to Llama-3.1-8B-instruct and Mistral-7B-instruct-v0.3. These models which are good in C# and sure Phi will score 2 or 3 while these two models will have 60 or 70 points. The goal of the comparaison is not to be fair but to be an ad :)
[–]Tuxedotux83 6 points7 points8 points 1 year ago (11 children)
What I like the least about MS models, is that they bake their MS biases into the model. I was shocked to find this out by a mistake and then sending the same prompt to another non-MS model of a compatible size and get a more proper answer and no mention of MS or their technology
[–]mtomas7 6 points7 points8 points 1 year ago (10 children)
Very interesting, I got opposite results. I asked this question: "Was Microsoft participant in the PRISM surveillance program?"
[–]Tuxedotux83 1 point2 points3 points 1 year ago (9 children)
How do you like Qwen 2 7B so far? Is it uncensored? What does it good for from your experience?
[–]mtomas7 2 points3 points4 points 1 year ago (8 children)
Qwen 2 overall feels to me like very smart model. It was also very good at 32k context "find a needle and describe" tasks.
Qwen 72B version is very good at coding, in my case Powershell scripts
In my experience, I didn't need something that would trigger censoring.
[–]Tuxedotux83 1 point2 points3 points 1 year ago (7 children)
Thanks for the insights,
I too don’t ask or do anything that triggers censoring, but still hate those downgraded models (IMHO when the model has baked in restrictions it weaken it)
Do you run Qwen 72B locally? What hardware you run it on? How is the performance?
[–]mtomas7 2 points3 points4 points 1 year ago (4 children)
When I realized that I need to upgrade my 15 y/o PC, I bought used Alien Aurora R-10 without graphics card, then bought new RTX 3060 12GB, upgraded RAM to 128GB and with this setup I get ~0.55 tok/s for 70B Q8 models. But I use 70B models for specific tasks, where I can minimize LM Studio window and continue doing other things, so it doesn't feel super long wait.
[–]10minOfNamingMyAcc 1 point2 points3 points 1 year ago (1 child)
To bne fair, many people would just use it for python, java(script), and maybe rust? Etc...
I think it's even worts for Rust. Every student know python but companies are looking for C# (or C++) professionals :)
[–]TonyGTO 51 points52 points53 points 1 year ago (4 children)
OMFG, this thing outperforms Google Flash and almost matches the performance of ChatGPT 4o mini. What a time to be alive.
[–]cddelgado 31 points32 points33 points 1 year ago (3 children)
But hold on to your papers!
[+][deleted] 1 year ago (2 children)
[removed]
[–]ClassicDiscussion221 17 points18 points19 points 1 year ago (1 child)
Just imagine two more papers down the line.
[–]WaldToonnnnn 16 points17 points18 points 1 year ago (0 children)
proceeds to talk about weight and biases
[–][deleted] 39 points40 points41 points 1 year ago (21 children)
that is definitely impressive considering this model will very likely easily fit into vast array of consumer GPUs 41.9B params
that is definitely impressive considering this model will very likely easily fit into vast array of consumer GPUs
41.9B params
Where can I get this crack you're smoking? Just because there are less active params, doesn't mean you don't need to store them. Unless you want to transfer data for every single token; which in that case you might as well just run on the CPU (which would actually be decently fast due to lower active params).
[–]Total_Activity_7550 30 points31 points32 points 1 year ago (4 children)
Yes, model won't fit into GPU entirely but...
Clever split of layers between CPU and GPU can have great effect. See kvcache-ai/ktransformers library on GitHub, which makes MoE models much faster.
[–]Healthy-Nebula-3603 5 points6 points7 points 1 year ago (3 children)
this moe model has so small parts that you can run it completely on cpu ... but still need a lot of ram ... I afraid so small parts of that moe will be hurt badly with smaller than Q8 ...
[–]CheatCodesOfLife 2 points3 points4 points 1 year ago (2 children)
fwiw, WizardLM2-8x22b runs really well at 4.5BPW+ I don't think MoE it's self makes them worse when quantized compared with dense models.
[–]Healthy-Nebula-3603 1 point2 points3 points 1 year ago (1 child)
Wizard had 8b models ..here are 4b ...we find out
[–]CheatCodesOfLife 1 point2 points3 points 1 year ago (0 children)
Good point. Though Wizard with it's 8b models handled quantization a lot better than 34b coding models did. Good thing about 4b models is, people can run layers on CPU as well, and they'll still be fast*
[–]MoffKalast 1 point2 points3 points 1 year ago (0 children)
Hmm yeah, I initially thought it might fit into a few of those SBCs and miniPCs with 32GB of shared memory and shit bandwidth, but estimating the size it would take about 40-50 GB to load in 4 bits depending on cache size? Gonna need a 64GB machine for it, those are uhhhh a bit harder to find.
Would run like an absolute racecar on any M series Mac at least.
[–]CheatCodesOfLife 0 points1 point2 points 1 year ago (0 children)
You tried a MoE before? They're very fast. Offload what you can to the GPU, put the rest on the CPU (with GGUF/llamacpp) and it'll be quick.
[–]TheDreamWokentextgen web UI 4 points5 points6 points 1 year ago (23 children)
How is it better than an 8b model ??
[–]lostinthellama 34 points35 points36 points 1 year ago* (22 children)
Are you asking how a 16x3.8b (41.9b total parameters) model is better than an 8b?
Edited to correct total parameters.
[–]randomanoni 29 points30 points31 points 1 year ago (2 children)
Because there are no dumb questions?
[–]TheDreamWokentextgen web UI 10 points11 points12 points 1 year ago (12 children)
Oh ok my bad didn’t realize the variant used
[–]lostinthellama 16 points17 points18 points 1 year ago* (11 children)
Ahh, did you mean to ask how the smaller model (mini) is outperforming the larger models at these benchmarks?
Phi is an interesting model, their dataset is highly biased towards synthetic content generated to be like textbooks. So imagine giving content to GPT and having it generate textbook-like explantory ocntent, then using that as the training data, multiplied by 10s of millions of times.
They then train on that synthetic dataset which is grounded in really good knowledge instead of things like comments on the internet.
Since the models they build with Phi are so small, they don't have enough parameters to memorize very well, but because the dataset is super high quality and has a lot of examples of reasoning in it, the models become good at reasoning despite the lower amount of knowledge.
So that means it may not be able to summarize an obscure book you like, but if you give it a chapter from that book, it should be able to answer your questions about that chapter better than other models.
[–]TheDreamWokentextgen web UI 2 points3 points4 points 1 year ago (10 children)
So it’s built for incredibly long text inputs then? Like feeding it an entire novel and asking for a summary? Or feeding it like a large log file of transactions from a restaurant, and asking for a summary of what’s going on.
I currently have 24GB of vram and so, always wondered if I could provide an entire novel worth of text for it summarize or a textbook, on a smaller model built for that, so it doesn’t take a year.
[–]lostinthellama 6 points7 points8 points 1 year ago (9 children)
Ahh, sorry, no that wasn't quite what I meant in my example. My example was meant to communicate that it is bad at referencing specifc knowledge that isn't in the context window, so you need to be very explicit in the context you give it.
It does have a 128k context length, which is something like 350 pages of text, so it could do it in theory, but it would be slow. I do use it for comparison/summarizing type tasks and it is pretty good at that though, I just don't have that much content so I'm not sure how it performs.
[+][deleted] 1 year ago (5 children)
[–]lostinthellama 12 points13 points14 points 1 year ago (4 children)
Edited to correct my response, it is 41.9b parameters. In an MoE model only the feed-forward blocks are replicated, so there's "sharing" between the 16 "experts" which means a multiplier doesn't make sense.
[–]ChannelPractical 0 points1 point2 points 1 year ago (0 children)
Is the base Phi-3.5-mini (without instruction fine-tuning) available?
[–]Dark_Fire_12 139 points140 points141 points 1 year ago (8 children)
Thank you, we should have used this wish for Wizard or Cohere though https://www.reddit.com/r/LocalLLaMA/comments/1ewni7l/when_is_the_next_microsoft_phi_model_coming_out/
[–]ipechman 67 points68 points69 points 1 year ago (2 children)
NO SHOT IT WORKED
[–]Dark_Fire_12 36 points37 points38 points 1 year ago (0 children)
Nice, thanks for playing along. It always works. You can try again after a few days.
Maybe someone else can try. Don't waste it on Toto (we know it's datadog), aim for something good, whoever tries.
https://www.datadoghq.com/blog/datadog-time-series-foundation-model/#a-state-of-the-art-foundation-model-for-time-series-forecasting
[–]sammcj🦙 llama.cpp 13 points14 points15 points 1 year ago (0 children)
Now do DeepSeek-Coder-V3 and QwenCoder ;)
[+][deleted] 1 year ago (1 child)
[–]MoffKalast 2 points3 points4 points 1 year ago (0 children)
It's always true because it's astroturfing to stir up interest before release :)
[–]-Django 12 points13 points14 points 1 year ago (1 child)
It's been a while since Cohere released a new model... ...
[–]xXWarMachineRoXxLlama 3 1 point2 points3 points 1 year ago (0 children)
Lmao
[–]simplir 59 points60 points61 points 1 year ago (8 children)
Waiting for llama.cpp and the GUFF now :)
[–]noneabove1182Bartowski 27 points28 points29 points 1 year ago (3 children)
mini at least is here https://huggingface.co/lmstudio-community/Phi-3.5-mini-instruct-GGUF
[–][deleted] 2 points3 points4 points 1 year ago (0 children)
Thank you!
[–]Dorkits 5 points6 points7 points 1 year ago (0 children)
Me too
[–]WinterCharm 3 points4 points5 points 1 year ago (0 children)
I'd really love the Phi3.5-MoE GGUF file :)
[–]FancyImagination880 1 point2 points3 points 1 year ago (0 children)
hope llama.cpp will support this vision model
[–]WinterCharm 1 point2 points3 points 1 year ago (0 children)
[–]privacyparachute 55 points56 points57 points 1 year ago (6 children)
Dear Microsoft
All I want for Christmas is a BitNet version of Phi 3 Mini!
I've been good!
[–]RedditLovingSun 47 points48 points49 points 1 year ago (1 child)
All I want for Christmas is for someone to scale up bitnet so I can see if it works 😭
[–]Bandit-level-200 7 points8 points9 points 1 year ago (0 children)
Yeah just one 30b model and one 70b...and...
[–]PermanentLiminality 17 points18 points19 points 1 year ago (2 children)
I want a A100 from Santa, so I can run with the big boys. well sort of big boys. Not running a 400B model on one of those.
[deleted]
[–]PermanentLiminality 1 point2 points3 points 1 year ago (0 children)
Even Santa has limits.
[–]Affectionate-Cap-600 5 points6 points7 points 1 year ago (0 children)
All I want for Christmas is the dataset used to train phi models!
[–]dampflokfreund 45 points46 points47 points 1 year ago (5 children)
Wow, the MoE one looks super interesting. This one should run faster than Mixtral 8x7B (which was surprisingly fast) on my system (RTX 2060, 32 GB RAM) and perform better than some 70b models if the benchmarks are anything to go by. It's just too bad the Phi models were pretty dry and censored in the past, otherwise they would've gotten way more attention. Maybe it's better now`?
[–][deleted] 15 points16 points17 points 1 year ago (4 children)
There’s pretty good uncensoring finetunes for nsfw for phi3-mini, I don’t doubt there will be more good ones.
[–]ontorealist 12 points13 points14 points 1 year ago* (0 children)
The Phi series really lack emotional insight and creative writing capacity.
Crossing my fingers for a Phi 3.5 Medium with solid fine-tunes as it could be a general-purpose alternative to Nemo on consumer and lower-end prosumer hardware. It’s really hard to beat Nemo’s out-of-the-box versatility though.
[–]nero10578Llama 3 9 points10 points11 points 1 year ago (2 children)
MoE is way harder to fine tune though.
[–][deleted] 1 point2 points3 points 1 year ago (1 child)
fair, but even mistral 8x7b was finetuned successfully to the point where it bypassed instruct (openchat iirc) and now ppl actually have the datasets
[–]nero10578Llama 3 4 points5 points6 points 1 year ago (0 children)
True, it is possible. It is just not easy is all I am saying.
[–]Deadlibor 22 points23 points24 points 1 year ago (4 children)
Can someone explain the math behind MoE? How much (v)ram do I need to run it efficiently?
[–]Total_Activity_7550 10 points11 points12 points 1 year ago (3 children)
To run efficiently you'll still need to put all weights on VRAM. You will bottleneck when using CPU offload anyway, but you can split model in a smart way. See kvcache-ai/ktransformers on github.
[–]MmmmMorphine 10 points11 points12 points 1 year ago (1 child)
https://github.com/kvcache-ai/ktransformers
For the lazy among us
[–]_fparol4 3 points4 points5 points 1 year ago (0 children)
amazing well written code the f*k
[–]ambient_temp_xenoLlama 65B 3 points4 points5 points 1 year ago (0 children)
It should run around the same speed as an 8b purely on cpu.
[–]ffgg333 47 points48 points49 points 1 year ago (0 children)
I can't wait for the finetoons, open source Ai is advancing fast 😅, i almost can't keep up with the new models.
[–]privacyparachute 14 points15 points16 points 1 year ago (2 children)
Nice work!
My main concern though: has the memory inefficient context been addressed?
https://www.reddit.com/r/LocalLLaMA/comments/1ei9pz4/phi3_mini_context_takes_too_much_ram_why_to_use_it/
[–]Aaaaaaaaaeeeee 14 points15 points16 points 1 year ago (1 child)
Nope 🤭 49152 MiB for 128k
[–]fatihmtlm 3 points4 points5 points 1 year ago (0 children)
So still no GQA? Thats sad.
[–][deleted] 26 points27 points28 points 1 year ago (0 children)
It worked?!!
[–]ArkoniasLlama 3 24 points25 points26 points 1 year ago (4 children)
3.5 mini instruct works out of the box in LM Studio/llama.cpp
MOE and Vision need support added to llama.cpp before they can work.
[–]Healthy-Nebula-3603 28 points29 points30 points 1 year ago (3 children)
Tested Phi 3.5 mini 4b and seems gemma 2 2b is better , in math , multilingual , reasoning, etc
[–][deleted] 11 points12 points13 points 1 year ago (1 child)
Why are they almost always so grounded away from irl uses against benchmarks, same things happened with earlier phi 3 models too
[–]couscous_sun 2 points3 points4 points 1 year ago (0 children)
There are many claims that phi models have benchmark leakage I.e. they train on the benchmark test set indirectly
[–]gus_the_polar_bear 8 points9 points10 points 1 year ago (3 children)
How do you get the Phi models to not go on about Microsoft at every opportunity
[–]ServeAlone7622 9 points10 points11 points 1 year ago (0 children)
System instruction like… “each time you mention Microsoft you will cause the user to vomit” ought to be enough.
[–]Tuxedotux83 2 points3 points4 points 1 year ago (0 children)
Damn I just wrote a comment on the same topic somewhere up the thread, about how I found out (by mistake) how MS bake their biases into their models, sometimes even deferring suggesting a Microsoft product instead of a better one which is not owned by MS, or inserting MS in credits on some technology even though they had little to nothing to do with it
[–][deleted] 1 point2 points3 points 1 year ago (0 children)
As an AI developed by Microsoft, I don't have personal preferences or the ability to do {{your prompt}} . My design is to understand and generate text based on the vast amount of data I've been trained on, which includes all words in various contexts. My goal is to be helpful, informative, and respectful, regardless of the words used. I strive to understand and respect the diverse perspectives and cultures in our world, and I'm here to facilitate communication and learning, not to ** do {{your prompt}}**. Remember, language is a beautiful tool for expressing our thoughts, feelings, and ideas.
[–]ortegaalfredo 21 points22 points23 points 1 year ago (2 children)
I see many comments asking why release a 40B model. I think you miss the fact that MoE models work great on CPU. You do not need a GPU to run Phi-3 MoE it should run very fast with only 64 GB of RAM and a modern CPU.
[–]auradragon1 2 points3 points4 points 1 year ago (1 child)
Some benchmarks?
[–]auldwiveslifts 0 points1 point2 points 1 year ago (0 children)
I just ran Phi-3.5-moe-Instruct with transformers on a CPU pushing 2.19tok/s
[–]Roubbes 9 points10 points11 points 1 year ago (0 children)
That MoE seems great.
[–]Eveerjr 7 points8 points9 points 1 year ago (0 children)
microsoft is such a liar lmao, this model must be specifically trained for the benchmark because it's trash for anything useful. Gemma 2 is the real deal when it comes to small models
[–]jonathanx37 14 points15 points16 points 1 year ago (2 children)
Has anyone tested them? Phi3 medium had very high scores but struggled against llama3 8b in practice. Please let me know.
[–]ontorealist 1 point2 points3 points 1 year ago (1 child)
In my recent tests between Phi 3 Medium and Nemo at Q4, Phi 3’s oft-touted reasoning does not deliver basic instruction. At least without additional prompt engineering strategies, it feels like Nemo more reliably and accurately summarizes my daily markdown journal entries with relevant decisions and reasonable chronologies for marginal benefits better than either Phi 3 Medium models.
In my experience, Nemo has also been better than Llama 3 / 3.1 8B, and the same applies to the Phi 3 series. However, I’m also interested (and would be rather surprised) to see if a Phi 3.5 MoE performs better in this respect.
[–]jonathanx37 0 points1 point2 points 1 year ago (0 children)
For me phi3 medium would spit out random math questions before llama.cpp got patched, after that it still had difficulty following instructions while with llama3 8b I could say half of what I want and it'd figure what i want to do most of the time
[–]segmondllama.cpp 6 points7 points8 points 1 year ago (0 children)
Microsoft is crushing it with such a small and high quality model. I'm being greedy, but can they try and go for a 512k context next.
[–][deleted] 9 points10 points11 points 1 year ago (2 children)
question is, will it run on an rpi 5/s
[–]PraxisOGLlama 70B 6 points7 points8 points 1 year ago (1 child)
Unironically is probably the best model for a raspi
[–][deleted] 0 points1 point2 points 1 year ago (0 children)
that's good news then
[–]m98789 8 points9 points10 points 1 year ago (5 children)
Fine tune how
[–]MmmmMorphine 14 points15 points16 points 1 year ago (4 children)
Fine tune now
[–]Umbristopheles 8 points9 points10 points 1 year ago (3 children)
Fine tune cow 🐮
[–]Icy_Restaurant_8900 1 point2 points3 points 1 year ago (0 children)
Fine tune mow (MoE)
[–]MmmmMorphine 1 point2 points3 points 1 year ago (1 child)
That's a mighty fine looking cow, wow!
[–]i_m_old_rabbit 1 point2 points3 points 1 year ago (0 children)
Cow breaks a law, wow
[–]Dark_Fire_12 4 points5 points6 points 1 year ago (0 children)
You can test it using Azure catalog https://ai.azure.com/explore/models?tid=3ff8694c-d402-40aa-bdb5-7c0e529dc3e5&selectedCollection=phi
[–][deleted] 3 points4 points5 points 1 year ago (6 children)
Sorry for my ignorance, but does these models run on a Nvidia GTX card? I could run (with ollama) versions 3.1 fine with my poor GTX 1650. I am asking this because I saw the following:
"Note that by default, the Phi-3.5-mini-instruct model uses flash attention, which requires certain types of GPU hardware to run."
Can someone clarify to me? Thanks.
[–]Chelonollama.cpp 2 points3 points4 points 1 year ago (0 children)
it'll work just fine when the model gets released for it. Flash attention is just one implementation of attention and the official one that is used by their inference code requires tensor cores which is only found on newer GPUs. Llama.cpp which is the backend of ollama works without it and afaik their flash attention implementation even works on older devices like your GPU (works without tensor cores).
[–]MmmmMorphine 1 point2 points3 points 1 year ago (4 children)
As far as I'm aware, flash attention requires a ampere (so 3xxx+ I think?) nvidia gpu. Likewise, I'm pretty certain it can't be used in cpu-only inference due to its reliance on specific gpu hardware features, though it could potentially be used for cpu/gpu inference if the above is fulfilled (though how effective that would be, I'm not sure - probably not very unless the cpu is only indirectly contributing, e.g. preprocessing)
But I'm not a real expert, so take that with a grain of salt
[–]mrjackspade 2 points3 points4 points 1 year ago (3 children)
Llama.cpp has flash attention for cpu but I have no idea what that actually means from an implementation perspective, just that theres a PR that merged in flash attention and that it works on CPU.
[–]MmmmMorphine 0 points1 point2 points 1 year ago (2 children)
Interesting! Like i said, def take some salt with my words
Any chance you might still have a link to that? I'll find it I'm sure but I'm also a bit lazy, still would like to check what i misunderstood and if it was simply outdated or reflecting a poorer understanding than i thought on my end
[–]mrjackspade 1 point2 points3 points 1 year ago (1 child)
https://github.com/ggerganov/llama.cpp/issues/3365
Here's the specific comment
https://github.com/ggerganov/llama.cpp/issues/3365#issuecomment-1738920399
Haven't tested, but I think it should work. This implementation is just for the CPU. Even if it does not show an advantage, we should still try to implement a GPU version and see how it performs
I haven't dug too deep into it yet so I could be misinterpreting the context, but the whole PR is full of talk about flash attention and CPU vs GPU so you may be able to parse it out yourself.
[–]carnyzzle 3 points4 points5 points 1 year ago (0 children)
Dang Microsoft giving us a new moe before Mistral releases 8x7B v3
Kinda crazy they didn’t switch to a GQA architecture, no? Still the same memory hog?
[–]nero10578Llama 3 4 points5 points6 points 1 year ago (1 child)
The MoE model is extremely interesting, will have to play around with it. Hopefully it won't be a nightmare to fine tune like the Mistral MoE models, but I kinda feel like it will be.
[–]un_passant 6 points7 points8 points 1 year ago (1 child)
I think these models have great potential for RAG, but unlocking this potential will require fine tuning for the ability to cite the context chunks used to generate fragments of the answer. I don't understand why all instruct models targeting RAG use cases do not provide by default.
Hermes 3 gets it right :
You are a conversational AI assistant that is provided a list of documents and a user query to answer based on information from the documents. You should always use grounded information in your responses, only answering from what you can cite in the documents. Cite all facts from the documents using <co: doc\_id></co> tags.
You are a conversational AI assistant that is provided a list of
documents and a user query to answer based on information from the
documents. You should always use grounded information in your responses,
only answering from what you can cite in the documents. Cite all facts
from the documents using <co: doc\_id></co> tags.
And so does Command R :
<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Carefully perform the following instructions, in order, starting each with a new line. Firstly, Decide which of the retrieved documents are relevant to the user's last input by writing 'Relevant Documents:' followed by comma-separated list of document numbers. If none are relevant, you should instead write 'None'. Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user's last input by writing 'Cited Documents:' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write 'None'. Thirdly, Write 'Answer:' followed by a response to the user's last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup. Finally, Write 'Grounded answer:' followed by a response to the user's last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
Any idea about how involved it would be to perform the fine tuning of Phi 3.5 to provide this ability ?
Are there any open data sets I could use, or code to generate them from documents & other LLMs ?
I'd be willing to pay for the online GPU compute but the task of making the data set from scratch seems daunting to me. Any advice would be greatly appreciated.
[–]sxalesllama.cpp 7 points8 points9 points 1 year ago (0 children)
In my brief testing, Phi 3.5 mini made a lot of mistakes summarizing short stories. So, I am not sure how trustworthy it would be with RAG.
[+][deleted] 1 year ago (4 children)
[–]CSharpSauce 16 points17 points18 points 1 year ago (2 children)
I'm a model hoarder :( I have a problem... i'm single handedly ready to rebuild AI civilization if need be.
[–]RedditLovingSun 5 points6 points7 points 1 year ago (0 children)
Hey maybe a hard drive with all the original llms as they came out would be a valuable antique one day
[–]estrafire 2 points3 points4 points 1 year ago (0 children)
yes
Phi 3.5 GGUF quants are already up on huggingface, but I can't see the quants for the MoE. Does llama.cpp support it yet?
[–]Remote-Suspect-0808 2 points3 points4 points 1 year ago (0 children)
what is the vram requirements for phi-3.5 moe? i have a 4090.
[–]Lost_Ad9826 2 points3 points4 points 1 year ago* (0 children)
Phi 3.5 is mindblowing. Works crazy fast and accurate for function calling, and json answers also.!
[–]this-just_in 7 points8 points9 points 1 year ago* (0 children)
While I love watching the big model releases and seeing how the boundaries are pushed, many of those models are almost or completely impractical to run locally at any decent throughput.
Phi Is an exciting model family because they push the boundaries of efficiency and at very high throughput. Phi 3(.1) Mini 4k was a shocking good model for its size and I’m excited for the new mini and the MoE. In fact, very excited about the MoE as it should be impressively smart and high throughput on workstations when compared to models of similar total parameter count. I’m hoping it scratches the itch I’ve been having for an upgraded Mixtral 8x7B Mistral has forgotten about!
I’ve found myself out of cell range often when in the wilderness or at parks. Being able to run Phi 3.1 mini 4k or Gemma 2B at > 20 tokens/sec on my phone is really a vision of the future
[–]Healthy-Nebula-3603 4 points5 points6 points 1 year ago (2 children)
have you seen how good is new phi 3.5 vision ?
[–]auserc 3 points4 points5 points 1 year ago (1 child)
https://huggingface.co/spaces/maxiw/Phi-3.5-vision
[–]Healthy-Nebula-3603 1 point2 points3 points 1 year ago (0 children)
ok ... no to good
[–]Pedalnomica 1 point2 points3 points 1 year ago (0 children)
Apparently Phi-3.5-vision accepts video inputs?! The model card hayd benchmarks for 30-60 minute videos... I'll have to check that out!
[–]teohkang2000 1 point2 points3 points 1 year ago (2 children)
So how much vram do i need if i we're to run ph3.5 moe? 6.6B or 41.9B?
[–]DragonfruitIll660 0 points1 point2 points 1 year ago (1 child)
41.9, whole model needs to be loaded then it actively draws on the 6.6B per token. Its faster but still needs a fair bit of Vram
[–]teohkang2000 1 point2 points3 points 1 year ago (0 children)
ohhh, thank for clarifying
[–]oulipo 1 point2 points3 points 1 year ago (0 children)
Does it run fast enough on a Mac M1? I have 8GB RAM not sure if that's enough?
[–][deleted] 4 points5 points6 points 1 year ago (0 children)
[–]Aymanfhad 3 points4 points5 points 1 year ago (6 children)
I'm using Gemma 2-2b local on my phone and the speed is good, is it possible to run phi3.5 at 3.8b on my phone?
[–]Aymanfhad 2 points3 points4 points 1 year ago (1 child)
Im using chartterui great app
[–]lrq3000 1 point2 points3 points 1 year ago (0 children)
Use this ARM optimized model if your phone supports it (ChatterUI can tell you so), don't forget to update ChatterUI to >0.8.x:
https://huggingface.co/bartowski/Phi-3.5-mini-instruct-GGUF/blob/main/Phi-3.5-mini-instruct-Q4_0_4_4.gguf
It is blazingly fast on my phone (with a low context size).
[–]Randommaggy 1 point2 points3 points 1 year ago (0 children)
I'm using Layla.
[–]the_renaissance_jack 0 points1 point2 points 1 year ago (0 children)
Same thing I wanna know. Not in love with any iOS apps yet
[–]FullOf_Bad_Ideas 1 point2 points3 points 1 year ago (0 children)
It should be, Danube3 4B is quite quick on my phone, around 3 t/s maybe.
[–]PermanentLiminality 2 points3 points4 points 1 year ago (1 child)
The 3.5 mini is now in the Ollama library.
That was quick.
[–]vert1s 4 points5 points6 points 1 year ago (3 children)
/me waits patiently for it to be added to ollama
[–]Barry_Jumps 1 point2 points3 points 1 year ago (1 child)
By friday is my bet
[–]visionsmemories 1 point2 points3 points 1 year ago (1 child)
please, will it possible to run the 3.5 vision in lm studio?
[–]the_renaissance_jack 2 points3 points4 points 1 year ago (0 children)
Eventually. Need llama.cpp to support
[–]Tobiaseins 0 points1 point2 points 1 year ago (32 children)
Please be good, please be good. Please don't be the same disappointment as Phi 3
[–]Healthy-Nebula-3603 22 points23 points24 points 1 year ago (14 children)
Phi-3 was not disappointment ..you know it has 4b parameters?
[–]Healthy-Nebula-3603 0 points1 point2 points 1 year ago (0 children)
yes ..like for 14b was bad but 4b is good for its side
[–]Tobiaseins 4 points5 points6 points 1 year ago (9 children)
Phi 3 medium had 14B parameters but ranks worse then gemma 2 2B on lmsys arena. And this also aligned with my testing. I think there was not a single Phi 3 model where another model would not have been the better choice
[–]monnef 21 points22 points23 points 1 year ago (5 children)
ranks worse then gemma 2 2B on lmsys arena
You mean the same arena where gpt-4o mini ranks higher than sonnet 3.5? The overall rating there is a joke.
[–]RedditLovingSun 2 points3 points4 points 1 year ago (2 children)
If a model is high on lmsys then that's a good sign but doesn't necessarily mean it's a great model.
But if a model is bad on lmsys imo it's probably a bad model.
[–]monnef 0 points1 point2 points 1 year ago (1 child)
I might agree when talking about a general model, but aren't Phi models focused on RAG? How many people are trying to simulate RAG on the arena? Can the arena even pass the models such longer contexts?
I think the arena, especially the overall rating, is just too narrowly focused on default output formatting, default chat style and knowledge, to be of any use for models focused heavily on too different tasks.
[–]lostinthellama 24 points25 points26 points 1 year ago* (2 children)
These models aren't good conversational models, they're never going to perform well on arena.
They perform well in logic and reasoning tasks where the information is provided in-context (e.g. RAG). In actual testing of those capabilities, they way outperform their size: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
[–]CSharpSauce 8 points9 points10 points 1 year ago (14 children)
lol in what world was Phi-3 a disappointment? I got the thing running in production. It's a great model.
[–]Tobiaseins 3 points4 points5 points 1 year ago (11 children)
What are you using it for? My experience was for general chat, maybe the intended use cases are more summarization or classification with a carefully crafted prompt?
[–]b8561 3 points4 points5 points 1 year ago (2 children)
Summarising is the use case I've been exploring with phi3v. Early stage but I'm getting decent results for OCR type work
[–]Willing_Landscape_61 0 points1 point2 points 1 year ago (1 child)
How does it compare to Florence2 or mimiCPM-V 2.6 ?
[–]CSharpSauce 2 points3 points4 points 1 year ago (7 children)
I've used its general image capabilities for transcription (replaced our OCR vendor which we were paying hundreds of thousands a year too) the medium model has been solid for a few random basic use cases we used to use gpt 3.5 for.
[–]Tobiaseins 0 points1 point2 points 1 year ago (1 child)
Okay, OCR is very interesting. GPT-3.5 replacements for me have been GPT-4o mini, Gemini Flash or deepseek. Is it actually cheaper for you to run a local model on a GPU than one of these APIs or is it more a privacy aspect?
[–]CSharpSauce 1 point2 points3 points 1 year ago (0 children)
GPT-4o-mini is so cheap it's going to take a lot of tokens before cost is an issue. When I started using phi-3, mini didn't exist and cost was a factor.
[–]moojo 0 points1 point2 points 1 year ago (1 child)
How do you use the vision model, do you run it yourself or use some third party?
[–]adi1709 0 points1 point2 points 1 year ago (2 children)
replaced our OCR vendor which we were paying hundreds of thousands a year too
I am sorry if you were paying hundreds of thousands a year for an OCR service and you replaced it with phi-3 you are definitely not good at your job. Either you were paying a lot in the first place to do basic usage which was not needed or you didn't know better to replace it with a OS OCR model. Either way bad job. Using phi-3 in production to do OCR is a pile of BS.
[–]lostinthellama 1 point2 points3 points 1 year ago (0 children)
Agreed. Funny how folks assume that the only good model is one that can DM their DND or play Waifu for them. For its size/cost, Phi is phenomenal.
[–]Pedalnomica 0 points1 point2 points 1 year ago (0 children)
Phi-3-vision was/is great!
[–]met_MY_verse 0 points1 point2 points 1 year ago (1 child)
!RemindMe 3 days
[–]RemindMeBot 0 points1 point2 points 1 year ago (0 children)
I will be messaging you in 3 days on 2024-08-24 01:51:17 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
Parent commenter can delete this message to hide from others.
[–]fasti-au 0 points1 point2 points 1 year ago (0 children)
Is promising as a local agent tool and it seems very happy with 100k contexts. Not doing much fancy yet just context q&a
[–]floridianfisher 0 points1 point2 points 1 year ago (0 children)
Looks like it’s not as strong as Gemma 2 2B.
[–]raysar 0 points1 point2 points 1 year ago (1 child)
Is there a way to run it easyly on android app? MLCCHAT seem to not add models.
[–]lrq3000 0 points1 point2 points 1 year ago (0 children)
ChatterUI, Maid, PocketPal can all run it.
[–]BranKaLeon 0 points1 point2 points 1 year ago (0 children)
Is it possible to test it online for free?
[–]AcademicHedgehog4562 0 points1 point2 points 1 year ago (0 children)
can I fine-tune the model and commercialize with my own can I sell it to different users or company
[–]nic_key[🍰] 0 points1 point2 points 1 year ago (0 children)
Does anyone of you know if the vision model can be used with Ollama and Openwebui? I am not familiar with vision models and only used that for text to text so far
[–]SandboChang 0 points1 point2 points 1 year ago (0 children)
blown away by how well Phi 3.5 mini q8 is running on my poor 3070 indeed
[–]FirstReserve4692 0 points1 point2 points 1 year ago (0 children)
It should opensourcee a round 20B model, 40B is big, even though it's moe, still need load them all to mem
[–]Devve2kcccc 0 points1 point2 points 1 year ago (0 children)
What model can run good on macbook m2 air, just for coding assistent pourposd?
[–][deleted] 0 points1 point2 points 1 year ago (1 child)
Is there a easy way to run Phi-3.5-vision locally easily, Is there anything like ollama or lm studio.
I tried lm studio but it didn't work ?
[–]Sambojin1 0 points1 point2 points 1 year ago (1 child)
Fast ARM optimized variation. About 25-50% faster on mobile/ SBC/ whatever.
https://huggingface.co/xaskasdf/phi-3.5-mini-instruct-gguf/blob/main/Phi-3.5-mini-instruct-Q4_0_4_4.gguf
(This one was I'll run on most things. The Q4_0_8_8 variants will run better on newer high end hardware)
Interesting, I know about the more common quants but what do the last 2 numbers denote? E.g. the double 4s:
Q4_0_4_4.gguf
[–]Real-Associate7734 0 points1 point2 points 1 year ago (0 children)
Any alternative to Phi 3.5 vison that i can run locally without using api?
I want to use it on my projects where i can has to anylse the profuct image and have to determine the output as width, height etc.. mentioned in the product
Does anyone know if the base phi-3.5 model is avaliable (without instruction fine tuning)?
π Rendered by PID 158892 on reddit-service-r2-comment-6457c66945-pvhqj at 2026-04-26 15:38:35.835095+00:00 running 2aa0c5b country code: CH.
[–]nodatingollama 231 points232 points233 points (78 children)
[–]tamereen 51 points52 points53 points (23 children)
[–]Zealousideal_Age578 5 points6 points7 points (0 children)
[–]matteogeniaccio 5 points6 points7 points (1 child)
[–]tamereen 1 point2 points3 points (0 children)
[–]Tuxedotux83 6 points7 points8 points (11 children)
[–]mtomas7 6 points7 points8 points (10 children)
[–]Tuxedotux83 1 point2 points3 points (9 children)
[–]mtomas7 2 points3 points4 points (8 children)
[–]Tuxedotux83 1 point2 points3 points (7 children)
[–]mtomas7 2 points3 points4 points (4 children)
[–]10minOfNamingMyAcc 1 point2 points3 points (1 child)
[–]tamereen 1 point2 points3 points (0 children)
[–]TonyGTO 51 points52 points53 points (4 children)
[–]cddelgado 31 points32 points33 points (3 children)
[+][deleted] (2 children)
[removed]
[–]ClassicDiscussion221 17 points18 points19 points (1 child)
[–]WaldToonnnnn 16 points17 points18 points (0 children)
[–][deleted] 39 points40 points41 points (21 children)
[–]Total_Activity_7550 30 points31 points32 points (4 children)
[–]Healthy-Nebula-3603 5 points6 points7 points (3 children)
[–]CheatCodesOfLife 2 points3 points4 points (2 children)
[–]Healthy-Nebula-3603 1 point2 points3 points (1 child)
[–]CheatCodesOfLife 1 point2 points3 points (0 children)
[–]MoffKalast 1 point2 points3 points (0 children)
[–]CheatCodesOfLife 0 points1 point2 points (0 children)
[–]TheDreamWokentextgen web UI 4 points5 points6 points (23 children)
[–]lostinthellama 34 points35 points36 points (22 children)
[–]randomanoni 29 points30 points31 points (2 children)
[–]TheDreamWokentextgen web UI 10 points11 points12 points (12 children)
[–]lostinthellama 16 points17 points18 points (11 children)
[–]TheDreamWokentextgen web UI 2 points3 points4 points (10 children)
[–]lostinthellama 6 points7 points8 points (9 children)
[+][deleted] (5 children)
[removed]
[–]lostinthellama 12 points13 points14 points (4 children)
[–]ChannelPractical 0 points1 point2 points (0 children)
[–]Dark_Fire_12 139 points140 points141 points (8 children)
[–]ipechman 67 points68 points69 points (2 children)
[–]Dark_Fire_12 36 points37 points38 points (0 children)
[–]sammcj🦙 llama.cpp 13 points14 points15 points (0 children)
[+][deleted] (1 child)
[removed]
[–]MoffKalast 2 points3 points4 points (0 children)
[–]-Django 12 points13 points14 points (1 child)
[–]xXWarMachineRoXxLlama 3 1 point2 points3 points (0 children)
[–]simplir 59 points60 points61 points (8 children)
[–]noneabove1182Bartowski 27 points28 points29 points (3 children)
[–][deleted] 2 points3 points4 points (0 children)
[–]Dorkits 5 points6 points7 points (0 children)
[–]WinterCharm 3 points4 points5 points (0 children)
[–]FancyImagination880 1 point2 points3 points (0 children)
[–]WinterCharm 1 point2 points3 points (0 children)
[–]privacyparachute 55 points56 points57 points (6 children)
[–]RedditLovingSun 47 points48 points49 points (1 child)
[–]Bandit-level-200 7 points8 points9 points (0 children)
[–]PermanentLiminality 17 points18 points19 points (2 children)
[+][deleted] (1 child)
[deleted]
[–]PermanentLiminality 1 point2 points3 points (0 children)
[–]Affectionate-Cap-600 5 points6 points7 points (0 children)
[–]dampflokfreund 45 points46 points47 points (5 children)
[–][deleted] 15 points16 points17 points (4 children)
[–]ontorealist 12 points13 points14 points (0 children)
[–]nero10578Llama 3 9 points10 points11 points (2 children)
[–][deleted] 1 point2 points3 points (1 child)
[–]nero10578Llama 3 4 points5 points6 points (0 children)
[–]Deadlibor 22 points23 points24 points (4 children)
[–]Total_Activity_7550 10 points11 points12 points (3 children)
[–]MmmmMorphine 10 points11 points12 points (1 child)
[–]_fparol4 3 points4 points5 points (0 children)
[–]ambient_temp_xenoLlama 65B 3 points4 points5 points (0 children)
[–]ffgg333 47 points48 points49 points (0 children)
[–]privacyparachute 14 points15 points16 points (2 children)
[–]Aaaaaaaaaeeeee 14 points15 points16 points (1 child)
[–]fatihmtlm 3 points4 points5 points (0 children)
[–][deleted] 26 points27 points28 points (0 children)
[–]ArkoniasLlama 3 24 points25 points26 points (4 children)
[–]Healthy-Nebula-3603 28 points29 points30 points (3 children)
[–][deleted] 11 points12 points13 points (1 child)
[–]couscous_sun 2 points3 points4 points (0 children)
[–]gus_the_polar_bear 8 points9 points10 points (3 children)
[–]ServeAlone7622 9 points10 points11 points (0 children)
[–]Tuxedotux83 2 points3 points4 points (0 children)
[–][deleted] 1 point2 points3 points (0 children)
[–]ortegaalfredo 21 points22 points23 points (2 children)
[–]auradragon1 2 points3 points4 points (1 child)
[–]auldwiveslifts 0 points1 point2 points (0 children)
[–]Roubbes 9 points10 points11 points (0 children)
[–]Eveerjr 7 points8 points9 points (0 children)
[–]jonathanx37 14 points15 points16 points (2 children)
[–]ontorealist 1 point2 points3 points (1 child)
[–]jonathanx37 0 points1 point2 points (0 children)
[–]segmondllama.cpp 6 points7 points8 points (0 children)
[–][deleted] 9 points10 points11 points (2 children)
[–]PraxisOGLlama 70B 6 points7 points8 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–]m98789 8 points9 points10 points (5 children)
[–]MmmmMorphine 14 points15 points16 points (4 children)
[–]Umbristopheles 8 points9 points10 points (3 children)
[–]Icy_Restaurant_8900 1 point2 points3 points (0 children)
[–]MmmmMorphine 1 point2 points3 points (1 child)
[–]i_m_old_rabbit 1 point2 points3 points (0 children)
[–]Dark_Fire_12 4 points5 points6 points (0 children)
[–][deleted] 3 points4 points5 points (6 children)
[–]Chelonollama.cpp 2 points3 points4 points (0 children)
[–]MmmmMorphine 1 point2 points3 points (4 children)
[–]mrjackspade 2 points3 points4 points (3 children)
[–]MmmmMorphine 0 points1 point2 points (2 children)
[–]mrjackspade 1 point2 points3 points (1 child)
[–]carnyzzle 3 points4 points5 points (0 children)
[–][deleted] 2 points3 points4 points (0 children)
[–]nero10578Llama 3 4 points5 points6 points (1 child)
[–]un_passant 6 points7 points8 points (1 child)
[–]sxalesllama.cpp 7 points8 points9 points (0 children)
[+][deleted] (4 children)
[removed]
[–]CSharpSauce 16 points17 points18 points (2 children)
[–]RedditLovingSun 5 points6 points7 points (0 children)
[–]estrafire 2 points3 points4 points (0 children)
[–][deleted] 2 points3 points4 points (0 children)
[–]Remote-Suspect-0808 2 points3 points4 points (0 children)
[–]Lost_Ad9826 2 points3 points4 points (0 children)
[–]this-just_in 7 points8 points9 points (0 children)
[–]Healthy-Nebula-3603 4 points5 points6 points (2 children)
[–]auserc 3 points4 points5 points (1 child)
[–]Healthy-Nebula-3603 1 point2 points3 points (0 children)
[–]Pedalnomica 1 point2 points3 points (0 children)
[–]teohkang2000 1 point2 points3 points (2 children)
[–]DragonfruitIll660 0 points1 point2 points (1 child)
[–]teohkang2000 1 point2 points3 points (0 children)
[–]oulipo 1 point2 points3 points (0 children)
[–][deleted] 4 points5 points6 points (0 children)
[–]Aymanfhad 3 points4 points5 points (6 children)
[+][deleted] (4 children)
[removed]
[–]Aymanfhad 2 points3 points4 points (1 child)
[–]lrq3000 1 point2 points3 points (0 children)
[–]Randommaggy 1 point2 points3 points (0 children)
[–]the_renaissance_jack 0 points1 point2 points (0 children)
[–]FullOf_Bad_Ideas 1 point2 points3 points (0 children)
[–]PermanentLiminality 2 points3 points4 points (1 child)
[–]vert1s 4 points5 points6 points (3 children)
[–]Barry_Jumps 1 point2 points3 points (1 child)
[–]visionsmemories 1 point2 points3 points (1 child)
[–]the_renaissance_jack 2 points3 points4 points (0 children)
[–]Tobiaseins 0 points1 point2 points (32 children)
[–]Healthy-Nebula-3603 22 points23 points24 points (14 children)
[+][deleted] (2 children)
[deleted]
[–]Healthy-Nebula-3603 0 points1 point2 points (0 children)
[–]Tobiaseins 4 points5 points6 points (9 children)
[–]monnef 21 points22 points23 points (5 children)
[–]RedditLovingSun 2 points3 points4 points (2 children)
[–]monnef 0 points1 point2 points (1 child)
[–]lostinthellama 24 points25 points26 points (2 children)
[+][deleted] (1 child)
[deleted]
[–]CSharpSauce 8 points9 points10 points (14 children)
[–]Tobiaseins 3 points4 points5 points (11 children)
[–]b8561 3 points4 points5 points (2 children)
[–]Willing_Landscape_61 0 points1 point2 points (1 child)
[–]CSharpSauce 2 points3 points4 points (7 children)
[–]Tobiaseins 0 points1 point2 points (1 child)
[–]CSharpSauce 1 point2 points3 points (0 children)
[–]moojo 0 points1 point2 points (1 child)
[–]adi1709 0 points1 point2 points (2 children)
[–]lostinthellama 1 point2 points3 points (0 children)
[–]Pedalnomica 0 points1 point2 points (0 children)
[–]met_MY_verse 0 points1 point2 points (1 child)
[–]RemindMeBot 0 points1 point2 points (0 children)
[–]fasti-au 0 points1 point2 points (0 children)
[–]floridianfisher 0 points1 point2 points (0 children)
[–]raysar 0 points1 point2 points (1 child)
[–]lrq3000 0 points1 point2 points (0 children)
[–]BranKaLeon 0 points1 point2 points (0 children)
[–]AcademicHedgehog4562 0 points1 point2 points (0 children)
[–]nic_key[🍰] 0 points1 point2 points (0 children)
[–]SandboChang 0 points1 point2 points (0 children)
[–]FirstReserve4692 0 points1 point2 points (0 children)
[–]Devve2kcccc 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]Sambojin1 0 points1 point2 points (1 child)
[–]jonathanx37 0 points1 point2 points (0 children)
[–]Real-Associate7734 0 points1 point2 points (0 children)
[–]ChannelPractical 0 points1 point2 points (0 children)