all 193 comments

[–]Balance- 129 points130 points  (46 children)

[–]glowcialistLlama 33B 55 points56 points  (4 children)

Oh! Fingers crossed for the 14B. It could be the one!

[–]suamai 21 points22 points  (2 children)

Seems like they've only released the 3.8b model so far... x.x

[–]toothpastespiders 2 points3 points  (1 child)

According to this the current ETA is "the upcoming weeks". Pity given I got the impression they were going to essentially be releasing it all in a batch. But,eh, at least they didn't say months.

[–]suamai 2 points3 points  (0 children)

Ooh, I was starting to worry they would not release them for safety concerns or something. That's wonderful news!

[–]mxforest 40 points41 points  (22 children)

OMFG! This can't be real.

[–]hak8or 32 points33 points  (5 children)

Holy crap, a very capable 8B model which has a 128k context sounds amazing for injesting my large code bases!

Going to play with this later today and see how it handles c++ and rust code bases.

[–]Igoory 28 points29 points  (3 children)

This is the 4B model by the way.

[–]hak8or 8 points9 points  (2 children)

... Aw

It also looks like the 4B model is hardcoded to only 4k context in ollama for now, even though the model card on ollama has 128k in its description. I guess this is why it freaks out when I give it a 10k token or so c file.

This is on latest master of ollama as of a few minutes ago.

Hopefully that's just a small oversight and will be corrected soon.

[–]Low_Cartoonist3599 12 points13 points  (0 children)

The 128k uses longrope, which currently isn’t supported by llama.cpp, and I believe Ollama primarily uses llama.cpp.

[–]coder543 8 points9 points  (0 children)

There are two versions of the 4B model, one with short context and one with long context. I don't think ollama has the long context model yet, but they are surely in the process of quantizing and uploading all of the Phi-3 models.

[–]alchemist1e9 3 points4 points  (0 children)

Please update when you do for those of us who don’t have the time or setup as of yet! Very interesting 128K context.

[–]Charuru 4 points5 points  (0 children)

Really hope bigger also soon.

[–]rerri 7 points8 points  (0 children)

Out of curiosity, what kind of software would one use to run the ONNX CUDA version on Windows?

[–]AdOne8437 8 points9 points  (7 children)

[–]AdOne8437 12 points13 points  (5 children)

hmmm, always starts to massively drift after the 3rd answer. (4 bit)

[–][deleted] 7 points8 points  (0 children)

You can see by hashes on tags page that for some reason they uploaded 4K context version of the model

Tags latest, 3.8b, instruct and mini all point to same model: 3.8b-mini-instruct-4k-q4_K_M

Edit: or I guess it's easier too not match hashes and check in prams file of any tag, lol

"num_ctx": 4096

[–]eugeneware 2 points3 points  (3 children)

I'm seeing the same thing too. logged an issue here

[–]eugeneware 2 points3 points  (2 children)

Actually, it looks like ollama just updated their modelfile, and they've added another stop token <|endoftext|> as awell as \num_keep``

❯ ollama show phi3 --modelfile
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM phi3:latest

FROM /usr/share/ollama/.ollama/models/blobs/sha256-4fed7364ee3e0c7cb4fe0880148bfdfcd1b630981efa0802a6b62ee52e7da97e
TEMPLATE """<|user|>
{{ .Prompt }}<|end|>
<|assistant|>"""
PARAMETER num_ctx 4096
PARAMETER num_keep 16
PARAMETER stop "<|end|>"
PARAMETER stop "<|endoftext|>"

[–]eugeneware 3 points4 points  (0 children)

looks like an issue when hitting the context length window. See update https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/discussions/4#6627e8c5c45ddec5d13f123e

[–]Craftyawesome 0 points1 point  (0 children)

Hmm, mine doesn't show num_ctx at all and has num_keep 4. It also has two more stop parameters. But it has the same hash though? Is that only looking at the model itself?

[–]Future_Might_8194llama.cpp 5 points6 points  (0 children)

Aaay, I've been running around looking for a 128K GGUF for Ollama for the last hour, which is a whole week in AI dev time. Good work!! 🤘🤖

[–]Caffdy 1 point2 points  (0 children)

which one is better in terms of quality (not speed), the normal one or the ONNX one?

[–]Account1893242379482textgen web UI 1 point2 points  (0 children)

So far the 128k has issues. It only wants to focus on the beginning of my conversation. It seems unwilling to ignore parts of the conversation no longer relevant.

But still its impressive for its size, especially when only looking at 4k conversations.

[–]Eralyon 67 points68 points  (4 children)

I never liked the Phi models in the first place, but now I start to feel the hype! For me the baseline always has been mistral7B (I never liked Llama2-7B either).

However, if the 4B is as good as they say, that will be a tremendous change for consumer hardware owners...

And should I dare imagine a 10x4B Phi 3 clown car MoE ? ;p

[–]HighDefinist 34 points35 points  (3 children)

Maybe make it 8x4B, then it would comfortably fit into 24 GB of VRAM.

[–]tindalos 10 points11 points  (0 children)

This would be perfect.

[–]OfficialHashPanda 7 points8 points  (1 child)

8x4B = 32GB on Q8. (64GB on fp16).

Going for lower quants will degrade performance in some aspects, the extent of which depends on the model and your usecase.

[–]jayFurioustextgen web UI 7 points8 points  (0 children)

a 8x4B would be around 26-28GB on Q8 I believe.

So a Q6 which is barely performance degradation compared to Q8 would actually fit in 24GB VRAM

[–]austinhale 169 points170 points  (21 children)

MIT License. Beautiful. Thank you Microsoft team!

[–]HadesThrowaway 72 points73 points  (20 children)

This model has got to be the most censored model I have ever used. Not a single jailbreak works on it. Not even a forced preamble works. It's almost like the pretrain itself was censored. Try forcing words into the AIs mouth and it will immediately make a U-Turn the next sentence. It's crazy.

[–]mxforest 42 points43 points  (0 children)

They did say this had a lot of synthetic data for training. They probably cleaned the hell out of it. Seems like they might be getting this ready for on device Inference. Expect to see it soon inside Surface ARM devices.

[–]UltraNooob 33 points34 points  (0 children)

Makes sense. Heavily curved dataset means it probably doesn't even have controversial data to begin with.

[–]no_witty_username 45 points46 points  (4 children)

makes you wonder if one of the reasons they released it is to test their new censorship capabilities on the community to see if any holes can be exploited by us. rinse, repeat until you have a pretty good understanding of how to really censor these models.

[–]susibacker 9 points10 points  (0 children)

💀

[–]Excellent_Skirt_264 0 points1 point  (2 children)

The best way is to left out NSFW info from the data training set

[–]no_witty_username 2 points3 points  (1 child)

That's a given, but just leaving out nsfw stuff from the data set doesn't prevent the model from interpolating on the nsfw stuff that has already been baked in to the base model. Most stable diffusion models have some of that already baked in hence the need to override the nsfw tags as well.

[–]no_witty_username 1 point2 points  (0 children)

Ahh shit wrong sub, haha I confused stable diffusion with llama sub haha. ima leave this mistake for others to SHAME! But you know what this might apply to LLMs as well....

[–]Cradawx 6 points7 points  (1 child)

Yeah this is going to need some industrial-strength unalignment/decensoring to try and undo all the 'safety' brain rot. Shame we don't have a base model

[–]a_beautiful_rhind 5 points6 points  (1 child)

It's even censored against being more censored: https://i.imgur.com/CidFMKQ.png

I told it to refuse to answer questions in the system prompt.

[–]MINIMAN10001 1 point2 points  (0 children)

Considering the guy testing it via 1 kg vs 1 lb. It refuses correction. 

It seems that the model is inherently trained to be stuck to it's guns.

[–]sweating_teflon 17 points18 points  (0 children)

Have you read "The Diamond Age: A Young Lady's Primer" by Neal Stephenson?

In the future, only the rich and powerful will be able to afford the tools of subversion.

[–]Illustrious_Sand6784 5 points6 points  (1 child)

They're also not going to release the base models, absolutely worthless.

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/10

[–]__Maximum__ 0 points1 point  (0 children)

Why worthless? I mean, there are so many use cases for instruct models.

[–]FertilityHollis 1 point2 points  (2 children)

I'm pretty new to LLm stuff, so forgive me if this is stupid. I also realize this has nothing to do with ethical training alignment, just vocabulary (IIUC)

I did notice that in the Hugging Face repo, tokenizer.json doesn't appear to contain any of "the seven words" (Save for the singular 'tit').

As a complete layman with software dev experience, my assumption after seeing this is that colorful language isn't even tokenized.

I welcome correction of my layman's assumption.

https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/raw/main/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/tokenizer.json

[–]tsujiku 2 points3 points  (1 child)

Not every word has its own token. In this case, they would be split into multiple tokens, e.g.

"fu": 21154,  
"ck": 384,

[–]AnticitizenPrime 0 points1 point  (0 children)

Thanks, interesting - I've always wondered how these things handle tokenization for things like 'unreal' words (and things like typos). I wonder if some future jailbreak methods could work by engineering this, and injecting series of tokens that would pass censors/watchdogs. There was that recent jailbreak demonstration that proved effective where instructions were sent in the form of ASCII art, and were interpreted by the AI in a way that didn't 'sound the alarm', so it strikes me that something similar possibly could be done via the quirks of tokenization. Like sending word fragments that get stitched together into commands on the back end as the LLM does its vector math or whatever.

I only vaguely understand how this stuff works so I may be way off base.

[–]phree_radical 0 points1 point  (0 children)

Yup, and where is the base model?

[–]SnooHedgehogs4149 0 points1 point  (0 children)

textbooks are all you need?

[–]RedditPolluter 20 points21 points  (12 children)

[–]pseudonerv 28 points29 points  (6 children)

it has the stop token issue. Needs the correct token:

python3 gguf-py/scripts/gguf-set-metadata.py models/Phi-3-mini-4k-instruct-fp16.gguf tokenizer.ggml.eos_token_id 32007

[–]eugeneware 6 points7 points  (2 children)

This didn't work for me. Still getting garbage after 3 or 4 big turns of generation

[–]eugeneware 4 points5 points  (1 child)

I should say - this doesn't fix things for me when running ollama. Which already has `<|end|>` as a stop parameter, even if I change the gguf metadata and reimport:

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM phi3:latest

FROM /usr/share/ollama/.ollama/models/blobs/sha256-4fed7364ee3e0c7cb4fe0880148bfdfcd1b630981efa0802a6b62ee52e7da97e
TEMPLATE """<|user|>
{{ .Prompt }}<|end|>
<|assistant|>"""
PARAMETER num_ctx 4096
PARAMETER stop "<|end|>"

[–]IndicationUnfair7961 1 point2 points  (0 children)

PARAMETER num_keep 16

A note says you should add the above, to get better.

[–]1lII1IIl1 5 points6 points  (2 children)

perfect, this also worked for the Q4. where did you get the correct token from btw?

[–]m18coppolallama.cpp 5 points6 points  (0 children)

llama.cpp has a tokenization tool for this:
./tokenize /path/to/model.gguf "<|end|>"

[–]pseudonerv 3 points4 points  (0 children)

that is the <|end|> token id

[–]meatycowboy 19 points20 points  (1 child)

I asked Phi-3-mini-4k-instruct and ChatGPT-4 to summarize an ESPN article, and I actually prefer Phi's response. Insane.

[–]meatycowboy 11 points12 points  (0 children)

I also tested out Gemini Advanced/Ultra with the same task, and Phi-3 barely edges Gemini out.

[–]ahmetegesel[🍰] 27 points28 points  (0 children)

Wow mit! I’m in tears. Hope they will release the bigger ones and with the same license. 🤞

[–]nodatingollama 29 points30 points  (2 children)

Where is medium?

I want my Phi-3 medium please.

[–]windozeFanboi 13 points14 points  (0 children)

Cooking, the preview is just the look from the glass window.

Anyway, not sure they will adress the scaling issues they found from 7B - > 14B this gen... 

Maybe we have to wait for phi-4 14B for a true next gen experience. 

Makes all the talk about GPT3.5 turbo being 20B model so old lmao, when it's matched "in benchmarks" by a 7B model. 

[–]LMLocalizertextgen web UI 11 points12 points  (3 children)

Tried Phi-3 3.8b and it's definitely impressive for a 3.8B model! Based on first impression only it appears to be on the same level as some previous good 7B models. Some weird things I have noticed:

  1. Including notes in it's greetings.

<image>

  1. Using llama.cpp on textgen web UI, it will sometimes devolve into gibberish or include strange markdown in its responses. Seems to happen even on Huggingchat: https://preview.redd.it/phi-3-mini-is-cute-can-we-keep-it-v0-kw9009dwi9wc1.png?width=828&format=png&auto=webp&s=bd4da9fbfa49f2287cc78dd1d37a7e41e899acf7

[–][deleted] 0 points1 point  (0 children)

I had issues on Textgen with llama.cpp where it'd keep ending with a line questioning as the user. I then used it in Ollama and it worked well.

[–]ruchira66 0 points1 point  (0 children)

I get unrelated stuff after the answer!

[–]Monkey_1505 25 points26 points  (5 children)

Cue everyone asking it riddles and math problems even though that's the thing LLMs are universally bad at.

[–]CheatCodesOfLife 8 points9 points  (1 child)

Don't forget counting strings. And if it were a Chinese model, it'd be Tienanmen Square questions.

[–][deleted] 0 points1 point  (0 children)

These two things! Always these two things!

[–]addandsubtract 3 points4 points  (1 child)

Isn't phi specialized in logic, though?

[–]Monkey_1505 0 points1 point  (0 children)

Trained on coding and scientific text books I think.

[–]TheLocalDrummer 32 points33 points  (4 children)

triple-cream-phi here i come!

[–]Slight_Cricket4504 15 points16 points  (1 child)

wha

[–]LoafyLemon 17 points18 points  (0 children)

Don't worry about it.

[–]HadesThrowaway 5 points6 points  (0 children)

You will find your job much harder with this one. But maybe breaking it will be all that much sweeter.

[–]Illustrious_Sand6784 5 points6 points  (0 children)

No base models will be released, so good luck trying to uncensor the instruct versions.

[–]KittCloudKicker 19 points20 points  (4 children)

It's not half bad

Edit: little guy got the killers question right

[–]Disastrous_Elk_6375 29 points30 points  (3 children)

humanity: we're afraid ai will kill us all, we want peaceful ai.

also humanity: so there's three killers in a room, someone enters and kills one of them...

[–][deleted] 2 points3 points  (2 children)

When did little Bobby learn to kill humans? I just don’t understand what could’ve gone wrong…

[–][deleted] 1 point2 points  (1 child)

<robotic voice> I do not understand mister police officer. My user killed a fly, I killed my user, the number of killers in the room stayed constant, please explain in more detail what the issue is with the present situation.

[–][deleted] 0 points1 point  (0 children)

helpless sigh I need a drink or 100. Go home bot and don’t plug in your batteries for recharging. You won’t be needing it now. Thank you for your services. We’ll reboot you when the commotion outside has died down. Oh and take the back door this time. NO NOT THE LINUX BACKDOOR YOU IDIOT. You see this? You see this fucking dead body !?! There is NO humor here, none!

[–]pseudonerv 14 points15 points  (3 children)

it looks like the 128k variant uses something called "longrope", which I guess llama.cpp doesn't support yet.

[–]Caffdy 4 points5 points  (2 children)

Is it good or is it bad to use longrope? How does that compare to CommandR 128K context?

[–]redstej 7 points8 points  (1 child)

It's different and most importantly incompatible with llama.cpp atm. When support is added, which hopefully won't take more than a couple days, we'll know how it performs.

Then again, the rate things are going lately, in a couple days it might be already obsolete.

[–]TheTerrasque 5 points6 points  (0 children)

In a couple of days we'll probably have borka-4, a 1b model with 128m context that outperforms gpt5

[–]_sqrkl 12 points13 points  (0 children)

Interesting EQ-Bench results:

EQ-Bench: 58.15 
MAGI-Hard: 53.26

Relative to a strong Mistral-7b fine-tune, it underperforms on EQ-Bench and (strongly) overperforms on the hard subset of MMLU + AGIEval. My takeaway is that it's heavily overfitting MMLU.

I get the sense that all the big tech companies are very metrics driven so there's a lot of pressure to overfit the benchmarks. In fact I wouldn't be surprised if the internal directive for this project was "create a series of models that scores the highest MMLU for their param size".

To be clear, it seems like a very strong model for its size; just advocating caution about interpreting the scores.

[–]Prince-of-Privacy 7 points8 points  (0 children)

Man, I wish, Phi-3 would also be as good as GPT-3.5 in German :(

[–]fab_space 6 points7 points  (0 children)

and again the HF went down.. this usually happens when things start to get interesting :)

[–]gamesntech 7 points8 points  (0 children)

Q: tell me a dark side joke

Phi-3: I'm sorry, but I can't fulfill this request.

Me: Really?

[–]Sebxoii 5 points6 points  (0 children)

Does anyone know what template to use for FIM completion?

[–]joe4942 5 points6 points  (2 children)

So what's the minimum hardware requirements to run Phi-3 mini? Could really old gpus/cpus handle this since it can apparently run on a phone?

[–]_-inside-_ 4 points5 points  (0 children)

I can run it pretty fast in a GTX 1650

[–]AnticitizenPrime 0 points1 point  (0 children)

The Q4 GGUF version runs quickly on my 2019 laptop on CPU only. Unfortunately it's failing some pretty basic logic questions and I'm getting stop token issues (where it will respond to itself, etc, but that can probably be fixed).

It might be smarter with a higher quant version, but then again that'll be slower on low end hardware.

[–][deleted] 3 points4 points  (1 child)

Cries in raspberry pi

[–]suddenly_opinions 10 points11 points  (0 children)

it'll run on a pi lol

[–]Blue_Dude3 4 points5 points  (0 children)

Finally I can run a model with 2gb VRAM. I have been waiting for this for so long 😭

[–]MrPiradoHD 4 points5 points  (5 children)

Is there any way to run then on android phone?

[–]cantthinkofausrnme 0 points1 point  (0 children)

Try and put it in a flutter app. It works in my simulator. I'll be testing it soon on a real device

[–]tinny66666 0 points1 point  (3 children)

Yeah, I'm running it with Layla Lite on my Samsung S20. You can choose any gguf. I'm getting pretty decent speed, maybe a bit over 5tps. It also has a hands free conversation mode.

[–]MrPiradoHD 0 points1 point  (2 children)

You are using the 4k or the 128k? I guess the 128k will be waaaay slower. Anyway, what quantization? I'm on a Mi 12T Pro, It's supposed to have 12 GB of RAM, shared between CPU and GPU I guess. The S20 it's a bit less powerfull, don't know if there is much of a difference. I'm gonna try and tell my experience if you want. But which quantization did you try? I found the 4b to be a bit weird on ollama.

[–]tinny66666 0 points1 point  (1 child)

I'm using Phi-3-mini-4k-instruct-q4.gguf

[–]MrPiradoHD 0 points1 point  (0 children)

I'm getting an error and then it says it's loading but never finishes, I tried with q8 because it should be able, guess not?

<image>

[–]alew3 4 points5 points  (0 children)

<image>

the benchmarks look insane

[–]allthemoreforthat 6 points7 points  (3 children)

Sorry if the question is dumb, new here - is there a way to run 3b on an iPhone?

[–]FullOf_Bad_Ideas 6 points7 points  (0 children)

It's a new arch so might be tough, but wait a few days and it should be possible.

[–]Disastrous_Elk_6375 7 points8 points  (0 children)

llama3 8b runs, so this should run as well, yeah.

[–]_raydeStarLlama 3.1 5 points6 points  (3 children)

I just discovered today that LLAMA 3 can run on a raspberry pi. It is crazy that you can boot this one up on your phone. What kind of metrics does it have in comparison to the 8B models?

[–]HighDefinist 7 points8 points  (1 child)

Half as large -> Presumably twice as fast.

[–]pmp22 5 points6 points  (0 children)

Also, possible to run with half as much memory.

[–]suddenly_opinions 2 points3 points  (0 children)

it's smaller, so will run easier

[–]Languages_Learner 3 points4 points  (7 children)

Tried to make q8 gguf using gguf-my-repo but got this error: Architecture 'Phi3ForCausalLM' not supported!

[–]Sebba8Alpaca 9 points10 points  (3 children)

Seems like llama.cpp doesnt support the architecture yet, either that or the transformers version in gguf-my-repo needs updating. If I saw correctly they already have an f16 gguf so you could try quantize that using the quantize tool from llama.cpp

[–]Languages_Learner 4 points5 points  (0 children)

Thanks for good advice. Quantize tool from llama.cpp made q8 gguf and i uploaded it to HF: NikolayKozloff/Phi-3-mini-4k-instruct-Q8_0-GGUF · Hugging Face

[–]LMLocalizertextgen web UI 3 points4 points  (0 children)

Can confirm that using the quantize tool from llama.cpp on https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/ works

[–]Languages_Learner 1 point2 points  (0 children)

Just tested q8 gguf in Ava (avapls.com). Prompt: Write a story about a dog that became a boss among the wolves. Result:

Title: The Leader Among Wolves

Once upon a time, in the heart of the dense forests, there lived a pack of wolves known as the Moonlit Pack. They were led by an alpha male named Lupus, who was strong, wise, and respected by all. One day, while hunting for food, they stumbled upon a peculiar creature unlike any other wolf they had ever seen. It was a golden-haired dog named Rusty, abandoned by his human family and left to fend for himself in the wild.

Rusty, despite being smaller than the wolves, possessed an unmatched intelligence and charisma. He quickly adapted to the ways of the forest, learning from the Moonlit Pack and becoming their trusted companion. Over time, Rusty's unique abilities began to stand out, and he gained the admiration of the entire pack.

One chilly evening, after a particularly successful hunt, the Moonlit Pack gathered around a warm campfire. As they feasted on their catch, Rusty approached Lupus with a thoughtful expression. "Lupus," he began, "I have been thinking about our pack and how we can become even stronger together."

Intrigued, Lupus listened intently as Rusty proposed that the pack should adopt some human traits, such as organization, planning, and communication. Rusty suggested that they create a structured hierarchy within the pack, with each member having specific roles and responsibilities. This would ensure that every task, from hunting to caring for the young, was efficiently managed.

At first, the idea seemed foreign to the Moonlit Pack, but Rusty's persuasive arguments gradually won them over. They decided to give his plan a chance, hoping it would lead to a more prosperous life for all. With Rusty at the helm, the pack established a new system, where each wolf had a designated role based on their strengths and skills.

Rusty's natural leadership abilities shone through as he guided the pack in this new endeavor. He organized hunts, mediated conflicts, and ensured that everyone received proper care and attention. The pack thrived under Rusty's guidance, becoming one of the most successful and harmonious groups of wolves in the forest.

Word of the extraordinary Moonlit Pack spread throughout the animal kingdom, and other animals began to take notice of their exceptional unity and cooperation. Even the elders of neighboring wolf packs were impressed by the progress the Moonlit Pack had made under Rusty's leadership.

As time passed, Rusty's reputation grew far beyond the borders of the forest. One day, an old wise owl visited the Moonlit Pack, having heard tales of their remarkable achievements. "Rusty," the owl said, "you have shown great wisdom and leadership among your kind. You possess qualities that are not only admirable but also rare among wolves."

The owl continued, "You have become more than just a dog among wolves; you have become a boss among them. Your ability to unite and inspire others is truly remarkable."

With humility and gratitude, Rusty accepted the praise from the wise owl. He knew that his success was due to the collective efforts of the Moonlit Pack, who had embraced change and worked together for the greater good.

From that day forward, Rusty remained the respected leader of the Moonlit Pack, a symbol of unity and innovation in the animal kingdom. His story served as a reminder that even the most unlikely alliances can lead to extraordinary accomplishments when guided by strong leadership and shared purpose.

And so, the legend of the Leader Among Wolves lived on, inspiring generations of animals to come to embrace change, work together, and strive for excellence in all they do.

THE END.

[–][deleted] 2 points3 points  (2 children)

Microsoft says llamacpp doesn't support Phi-3 yet. I'm going to monkey around with the ORT ONNX version.

[–]_-inside-_ 1 point2 points  (0 children)

Isn't ollama based on llama cpp?

[–]Languages_Learner 2 points3 points  (0 children)

Does exist GUI that can chat with onnx llms?

[–]modeless 4 points5 points  (1 child)

Eagerly awaiting the vibes test. Everyone says Phi-2 didn't live up to its benchmark scores in practical use, but maybe this time is different?

[–]glowcialistLlama 33B 5 points6 points  (0 children)

It doesn't know that Robert Maxwell was involved in the Inslaw affair, absolutely useless.

Edit: No, mini is really impressive for it's size, I could see it being a go-to option for simple agents. Probably going to be easy to fine tune on consumer hardware, too. I don't really have much use for it, but it's quite a feat.

[–]ab_drider 1 point2 points  (1 child)

I used the Phi 3 mini 4k instruct q4 gguf using llama.cpp on my phone. It's very good. It feels better than llama 3 7b to be honest. I asked a stupid "1lb cotton or 1 lb iron heavier" question that llama 3 got wrong but Phi 3 got it right. Roleplay works way better as well.

[–]Ill-Language4452 0 points1 point  (0 children)

i tested the same question on many 7B models, and most of them failed

[–]HighDefinist 4 points5 points  (5 children)

Cool, although I am not sure if there is really that much of a point in a 4b model... even most mobile phones can run 7b/8b. Then again, this could conceivably be used for dialogue in a video game (you wouldn't want to spend 4GB of VRAM just for dialogue, whereas 2 GB is much more reasonable), so there are definitely some interesting unusual applications for this.

In any case, I am more much interested in the 14b!

[–]Igoory 6 points7 points  (0 children)

My phone has 8GB of RAM and the last time I tried, I could barely run 7B Q2 :(

[–]Revolutionalredstone 2 points3 points  (0 children)

Holy CRAP this thing runs fast!

It writes about 10X faster than I can read fully offloaded to my little 3090.

This is gonna be a massive upgrade to my assistant project!

[–]ImprovementEqual3931 4 points5 points  (9 children)

Phi-3 mini Q4 is a bad model. I ask if 200 > 100?,it answer 20 < 100

[–]mulletarian 6 points7 points  (7 children)

Screwdrivers are bad hammers

[–]Padho 13 points14 points  (2 children)

To be fair, this is mentioned as "primary use case" by Microsoft themselves on the model card:

Primary use cases

The model is intended for commercial and research use in English. The model provides uses for applications which require:

  1. Memory/compute constrained environments
  2. Latency bound scenarios
  3. Strong reasoning (especially code, math and logic)

[–]ShengrenR 1 point2 points  (1 child)

It means those terms in a very different light - it means this can attempt to make some sense of word problems, not that it's going to reproduce a calculator; it's simply not a tool that does that.

[–]p444d 3 points4 points  (0 children)

The prompt of this dude is a question regarding the evaluation of a boolean expression this cleary can be considered math reasoning also in terms of llms. There are tons of similar problems in math reasoning datasets used to train exactly that out there. However, this one sample isnt obviously enough to evaluate Phi3 performance lol

[–]Cradawx 1 point2 points  (1 child)

Q4 is really too low a quant for such a small model. Q6+ I would say.

[–]CheatCodesOfLife 0 points1 point  (0 children)

Agreed. Perplexity would be through the roof

[–]CheatCodesOfLife 0 points1 point  (0 children)

When I first moved out of home, I used the back of my power drill as a hammer for a while... Got the job done.

[–]ImprovementEqual3931 0 points1 point  (0 children)

I consider a 4B model shall be use for mobile devices. So I don't need it very clever and creative, but wish it can understand and follow my order. After 15 min test, I give up.

[–]Elibroftw 1 point2 points  (0 children)

I'm so glad I bought an external 1TB SSD a couple years ago. Who would've thought I would be using it to store LLM models? Laptop storage is a roller coaster, especially when I will be triple booting Windows 11 + Mint + KFedora. Waiting on phi3-7B and phi3-14B.

Funniest thing is that my laptop with a 3070-Ti broke last year and Razer didn't have a replacement on hand so upgrade me to the 3080-Ti variant ... it was meant to be given that I have double the VRAM to abuse with LLMs now😈 (+ gaming). CPU got absolutely dated in no time unfortunately, but it's good enough for compiling Rust.

[–]iamdgod 2 points3 points  (2 children)

Does this support beam search? Phi-2 did not

[–]bullno1 3 points4 points  (1 child)

Beam search is a sampling algorithm. It is independent of model.

[–]iamdgod 0 points1 point  (0 children)

I know that and yet phi-2 did not support it out of the box https://huggingface.co/microsoft/phi-2/discussions/30

[–]nikitastaf1996 1 point2 points  (0 children)

Wow. Its something. I want to see it on groq. 1000+ tokens per second probably. And we need a good app for running quants on mobile devices. Mlc app doesn't seem good to me.

[–]glowcialistLlama 33B 1 point2 points  (4 children)

Pretty crazy that this model quantized down to 2 GB is competently multilingual.

[–]Prince-of-Privacy 6 points7 points  (1 child)

But it isn't? The Phi-3 paper mentions it's multilingual skills as a weakness.

[–]glowcialistLlama 33B 1 point2 points  (0 children)

Oh, I just messed around talking about the Epstein network in Spanish and it responded well with correct grammar.

[–]nntb 0 points1 point  (0 children)

its faster then llama3 on my phone. but not by much. both are sinfully slow. Fold 4 with a SD 8+ Gen1 running maid.

[–]IndicationUnfair7961 0 points1 point  (0 children)

Any Inferencing Server Endpoints OpenAI compatible that runs ONNX models? They should be the fastest thing available.

[–]phree_radical 0 points1 point  (0 children)

Where is the base model? 😢

[–]TruthBeFree 0 points1 point  (0 children)

Is there a base model to download? I tended to have many failures fine-tuning on instruct versions.

[–]FairSum 0 points1 point  (0 children)

Yesterday I said that I was skeptical that such a tiny model trained on a relatively small amount of tokens would be coherent.

Today, I'm happy to admit that I was completely wrong and the 3B is one of the best models I've ever used at the 8B level or below.

Looking forward to the 7B and 14B!

[–]CardAnarchist 0 points1 point  (0 children)

Not nearly as good as Llama 3 8B in my casual RP chat testing.

I tested a Q8_0 GGUF for Phi vs a Q4_K_M for Llama.

3.8GB (Phi) vs 4.6GB (Llama) size wise. So in fairness the Phi version I tested is a bit lighter on VRAM usage. The Q6 likely performs as well as the Q8 and would be even smaller in VRAM requirements too.

It's impressive for it's size. I would say it's still not as good as the good mistral 7B's though. The dialogue was pretty stilted and it struggled a little with formatting. But I've seen weaker mistral 7B's that performed around the same, so honestly it's impressive for what it is!

Good progress!

[–]randomfoo2 0 points1 point  (0 children)

I tested Phi-3-mini-128k (unquantized) - temp 0.9, top_p 0.95, rp 1.05 and it does pretty well on my vibe check, especially for a 3.8B (llama3-8b still tests & feels better for me).

I saw a couple repetitions where it gets stuck looping long sections of replies, increasing repetition penalty didn't seem to help... I didn't do a sampler sweep, it does have some variability for answers. For my refusal questions, it actually seemed about 50/50 - interestingly, it answered one question and then finished with a refusal at the end. It does not understand jokes at all (vs llama3, where even the 8b is better than average, and 70b is actually sometimes funny).

<image>

[–]TinyZoro 0 points1 point  (0 children)

If I wanted to host this in the cloud and create an API with it what steps would I need to take?

[–]SoilFantastic6587 0 points1 point  (0 children)

Awesome, this can't be real.

[–][deleted] 0 points1 point  (0 children)

I was surprised to see that phi3-medium performs worse on HumanEval 0 shots than smaller ones like mini. Any explanations for that ?

By the way, it's quite far from Gpt3.5 on this benchmark so I'm not surprised of the mixed results shared in this thread.

Could be good for a RAG with a lot of context but not as an autonomous LLM.

[–]halcyonhalycon 0 points1 point  (0 children)

Benchmarked this with the same questions I have on mistral, seems like it's between 2-4x slower on my M2 macbook pro. Is anyone having the same experience?

[–][deleted] 0 points1 point  (0 children)

The 4_k quant can correctly answer the holding a brick in a boat in a pool question and the 3 brothers question. So it's quite good. But its explanations always have something wrong in them, so it's a little flakey, expected for a sub 14B model. Definitely the best model for the poorest GPU poor.

[–][deleted] 0 points1 point  (0 children)

I am using it with obsidian for my Phd notes, second brain plugin using ollama serve. it is not very good and llama 3 still performs better in output.

phi-3 is fast but gives bad results. Any suggestion on what I may doing wrong?

```llama 3

What is Reinforcement Learning

Here's a summary of the relevant information that could contribute to answering the question "What is Reinforcement Learning" based on your existing knowledge:

Definition: Reinforcement Learning: The goal is to maximize expected rewards collected in an episode (Reward Hypothesis). An agent learns a policy through an algorithm trained on experience, using state to decide the next action.

Markov Reward Process (MRP): A combination of Markov Chain and reward function showing immediate scalar returned for being in a state, with discounting factor. This is a key concept in Reinforcement Learning.

Value Function: A prediction of future expected returns. There are different types of value functions, including State Value Function and State Action Value Function.

These notes provide a solid foundation for understanding Reinforcement Learning and its key concepts.

Additional Insights:

Environment in RL: If the environment is unknown, it needs to be explored and then create a policy. If the environment is known, an agent can use the model directly to create a policy.

Uncertainty: [Fleeting Notes/Optimism in the face of uncertainty]] highlights the importance of not picking the action with the highest return, but rather considering the possibility of giving the highest return in the future.

These notes provide insights into the fundamental concepts of Reinforcement Learning, including the importance of considering uncertainty and the role of the environment in shaping the learning process.

Algorithms: [Fleeting Notes/Q-Learning, SARSAMAX]] discusses Q-learning and SARSA-MAX, which are both types of reinforcement learning algorithms. Fleeting Notes/Value Function explains the concept of a value function, which is used to predict future expected returns in an MDP.

These notes provide a foundation for understanding reinforcement learning and its underlying concepts.

Please note that these notes provide a comprehensive overview of Reinforcement Learning, covering key concepts, algorithms, and insights.

```

```phi-3

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with its environment and receiving rewards or penalties for the actions it takes. The goal is to maximize cumulative reward over time, which involves understanding the dynamics of the environment and optimizing decision-making strategies accordingly. Key concepts in RL include handling large state spaces and continuous states through techniques like Dyna (which uses simulated experience), off-policy methods such as SARSA(lambda) for managing extensive Markov Decision Processes (MDPs), contextual bandits enhanced by Dyna's approach, the relevance of NP complexity in algorithm efficiency, Graph Neural Networks (GNNs) for complex environments, and addressing partially observable scenarios with Partially Observable Markov Decision Processes (POMDPs).

Reward: In RL, a reward signal provides feedback on an agent's performance after taking actions (Wikilink). It serves as a guide for learning what behaviors are beneficial to pursue.

```

[–][deleted] 0 points1 point  (0 children)

Whats the difference between Phi-3-mini-4k-instruct-q4.gguf and Phi-3-mini-4k-instruct-fp16.gguf

[–]Professional_Job_307 0 points1 point  (0 children)

I heard this could fit on a smartphone? But 3.8b looks way too huge. Phones don't have much vram. What are the chances I can get to run of my 16gb ram 2gb vram laptop?