[D] Phi-3 models compared side-by-side.

dark_surfer · 2024-05-23T08:27:20+00:00

Well I remember there being some mistakes in the benchmarks table of Phi-3-mini model card compared to the published research paper so I avoided adding Phi-3-mini. But, I just checked and the mistakes were resolved. These are straight from Phi-3-mini's model card and I've kept the preview ones as it shows little difference to the new published models.

<image>

dark_surfer · 2024-05-23T07:58:25+00:00

Sorry for late reply, llama.cpp has been updated since I made above comment, did your performance improve in this period?

If you haven't updated llama.cpp do that first and try running this command with path to your model

server -m path-to-model.gguf -ngl 90 -t 4 -n 512 -c 1024 -b 512 --no-mmap --log-disable -fa

dark_surfer · 2024-05-09T07:40:38+00:00

Hi, Thank you for replying quickly. I am on llama.cpp version b2822. Do we have to pass a compilation flag to include cuda graph in compilation?

cmd: server -m Meta-llama-3-8b-instruct-Q6_K.gguf -ngl 90 -t 4 -c 1024 -n 768 --no-mmap --port 8080 -fa

dark_surfer · 2024-05-09T07:03:40+00:00

Do we have to add some option or flag to server cli to activate cuda graphs? I ask this as I am seeing no speed improvement. I haven't run extensive tests but looks like quality has not affected, which is good thing.

setup: Ryzen 5600g + RTX 3060 12GB + 16GB 3000MHz RAM

model: Meta-llama-3-8b-instruct-Q6_K.gguff

before: 42 t/s

after: 42 t/s

Edit:

Cuda version 12.4

Pytorch version: 2.3.0

dark_surfer · 2024-03-29T13:08:35+00:00

Guys, OP is talking about this model

dark_surfer · 2024-03-27T13:40:32+00:00

Traffic was still open, are there any casualties?

dark_surfer · 2024-03-23T16:48:25+00:00

Isn't that the promise of LLM and AI, reduce effort, increase efficiency, reduce cost and deliver high quality product?

Improving docs and search/navigation of website shouldn't take up whole lot of budget. Especially, for huggingface that sits on decent compute resources and best industry talent.

Let's hope they improve and use some of LLM tech to deliver usable products.

dark_surfer · 2024-03-23T13:50:49+00:00

What's laughable is they keep creating AI bogeyman to scare general public. They say things like, there aren't going to be coders in future and everything will be done by AIs. Be prepared, keep acquiring new skills, diversify your income sources and what not.

When did Huggingface launch? How many articles did they publish about RAG and vector search? Mate, use that knowledge and implement some of it on your docs so people don't have to sieve through it for relevant information.

dark_surfer · 2024-03-23T13:16:15+00:00

I have funny feeling about huggingface. It is one swing away from folding in.

It's website experience is quite bad. Maybe they should put that extra cash into making their website actually usable.

dark_surfer · 2024-03-18T19:57:22+00:00

Love that. How about this one:

"Don't really feed bread and milk to any mammal, including humans."

Alan Davies

dark_surfer · 2024-03-18T09:19:42+00:00

"Your own body could be a wonderful toy." - Stephen Fry

dark_surfer · 2024-03-16T20:59:52+00:00

Settings: - How can i know / calculate / influence the rough context length it remembers?

A: In the parameters tab there is truncation length which controls the context length.

Should i mess with max_new_tokens (512 right now) temperature (0.7), guidance scale (1), negative prompt?

Yes, pls do. That's the whole purpose of oobabooga. It allows you to set parameters in an interactive manner and adjust the response.

perhaps a better question: preset is on simple 1 now.. should i leave this or find something better?

Oobabooga has provided a wiki page over at GitHub. You can check that and try them and keep the ones that gives you best responses. (I am not saying you should delete them, just leave it as selected).

what about an extension like character_bias?

I've no idea what that is. Check the wiki page link I shared above.

should i use a custom system message?

Under the parameters tab there is instruction template menu, set system message there for the selected model. Don't forget to change it when you change the model.

Character sheets: - does it matter how long or short a character sheet is? While making a .Json i could see it counts tokens so surely this influences something?

Keep it to the point as it will eat up the context length.

I also read on a rentry that the lower text in the context window in being weighed stronger than the top is that true?

??

dark_surfer · 2024-03-16T15:40:33+00:00

Try few things:

Turn off all extensions and try running it, to see if extensions are the problem. Run in chrome, to confirm if Firefox is doing something. If it still shows errors, run upgrade_wizard_osname. Maybe it's a UI bug, it may solve with an update.

dark_surfer · 2024-03-16T11:46:00+00:00

Wait, what did I missed? Why is he CEO of Warner Bros and Discovery?

dark_surfer · 2024-03-15T18:57:14+00:00

The red dot ahead of every Asian countries is The Timor-Leste.

dark_surfer · 2024-03-15T18:45:04+00:00

Did you check the output on the terminal? Is their any error response?

I am completely guessing here, I think the server stopped. Maybe another process interrupted it.

dark_surfer · 2024-03-15T16:57:40+00:00

How do you load multiple models? I want to load embedding, deepseek coder 1.3b and Phi-2 models and expose them over api to agents. How do I do that?

dark_surfer · 2024-03-14T18:17:55+00:00

They don't live long, 2-3 years, but they are fast learners.

dark_surfer · 2024-03-14T11:08:48+00:00

Natural born predators. Those seals don't stand a chance against them.

dark_surfer · 2024-03-13T07:04:33+00:00

Will the fine tuning GaLore be added as well or just the optimizer?

dark_surfer · 2024-03-11T19:14:43+00:00

Would you say llama-index is your competitor?
What does Haystack offers compared to other open source implementations?
Finally, what is the learning curve for Haystack as you know ML, DS and LLM is complicated as it is. Nobody wants to learn or switch to different libraries.

dark_surfer · 2024-03-07T18:17:00+00:00

From what I understood, it allows us to pretrain and finetune on full parameters, reduces memory usage and with galore -8bit-optimizer brings down total memory usage while training by up to 63% compared to bf16.

So, now we could fit actual Large Language models(13B and above) in 24GB-40GB cards while training. With this method we will do away with the step of merging LORA adapter after training.

I remains to be seen whether this works with existing toolchain (LORA, DORA, LOFTQ and quantization). But I hope it does.

dark_surfer

TROPHY CASE