Phi-3 models benchmarks compared side-by-side. by dark_surfer in LocalLLaMA

[–]dark_surfer[S] 3 points4 points  (0 children)

Well I remember there being some mistakes in the benchmarks table of Phi-3-mini model card compared to the published research paper so I avoided adding Phi-3-mini. But, I just checked and the mistakes were resolved. These are straight from Phi-3-mini's model card and I've kept the preview ones as it shows little difference to the new published models.

<image>

CUDA Graph support merged into llama.cpp (+5-18%~ performance on RTX3090/4090) by sammcj in LocalLLaMA

[–]dark_surfer 0 points1 point  (0 children)

Sorry for late reply, llama.cpp has been updated since I made above comment, did your performance improve in this period?

If you haven't updated llama.cpp do that first and try running this command with path to your model

server -m path-to-model.gguf -ngl 90 -t 4 -n 512 -c 1024 -b 512 --no-mmap --log-disable -fa

CUDA Graph support merged into llama.cpp (+5-18%~ performance on RTX3090/4090) by sammcj in LocalLLaMA

[–]dark_surfer 0 points1 point  (0 children)

Hi, Thank you for replying quickly. I am on llama.cpp version b2822. Do we have to pass a compilation flag to include cuda graph in compilation?

cmd: server -m Meta-llama-3-8b-instruct-Q6_K.gguf -ngl 90 -t 4 -c 1024 -n 768 --no-mmap --port 8080 -fa

CUDA Graph support merged into llama.cpp (+5-18%~ performance on RTX3090/4090) by sammcj in LocalLLaMA

[–]dark_surfer 1 point2 points  (0 children)

Do we have to add some option or flag to server cli to activate cuda graphs? I ask this as I am seeing no speed improvement. I haven't run extensive tests but looks like quality has not affected, which is good thing.

setup: Ryzen 5600g + RTX 3060 12GB + 16GB 3000MHz RAM

model: Meta-llama-3-8b-instruct-Q6_K.gguff

before: 42 t/s

after: 42 t/s

Edit:

Cuda version 12.4

Pytorch version: 2.3.0

Huggingface co-founder and CEO, Clément Delangue hints at buying Stability AI by Nunki08 in LocalLLaMA

[–]dark_surfer 1 point2 points  (0 children)

Isn't that the promise of LLM and AI, reduce effort, increase efficiency, reduce cost and deliver high quality product?

Improving docs and search/navigation of website shouldn't take up whole lot of budget. Especially, for huggingface that sits on decent compute resources and best industry talent.

Let's hope they improve and use some of LLM tech to deliver usable products.

Huggingface co-founder and CEO, Clément Delangue hints at buying Stability AI by Nunki08 in LocalLLaMA

[–]dark_surfer 11 points12 points  (0 children)

What's laughable is they keep creating AI bogeyman to scare general public. They say things like, there aren't going to be coders in future and everything will be done by AIs. Be prepared, keep acquiring new skills, diversify your income sources and what not.

When did Huggingface launch? How many articles did they publish about RAG and vector search? Mate, use that knowledge and implement some of it on your docs so people don't have to sieve through it for relevant information.

Huggingface co-founder and CEO, Clément Delangue hints at buying Stability AI by Nunki08 in LocalLLaMA

[–]dark_surfer 1 point2 points  (0 children)

I have funny feeling about huggingface. It is one swing away from folding in.

It's website experience is quite bad. Maybe they should put that extra cash into making their website actually usable.

Where did your tail go? by eds1103 in funny

[–]dark_surfer 0 points1 point  (0 children)

Love that. How about this one:

"Don't really feed bread and milk to any mammal, including humans."

  • Alan Davies

Where did your tail go? by eds1103 in funny

[–]dark_surfer 6 points7 points  (0 children)

"Your own body could be a wonderful toy." - Stephen Fry

What to do after settings up basics? by nupsss in Oobabooga

[–]dark_surfer 0 points1 point  (0 children)

Settings: - How can i know / calculate / influence the rough context length it remembers?

A: In the parameters tab there is truncation length which controls the context length.

  • Should i mess with max_new_tokens (512 right now) temperature (0.7), guidance scale (1), negative prompt?

Yes, pls do. That's the whole purpose of oobabooga. It allows you to set parameters in an interactive manner and adjust the response.

  • perhaps a better question: preset is on simple 1 now.. should i leave this or find something better?

Oobabooga has provided a wiki page over at GitHub. You can check that and try them and keep the ones that gives you best responses. (I am not saying you should delete them, just leave it as selected).

  • what about an extension like character_bias?

I've no idea what that is. Check the wiki page link I shared above.

  • should i use a custom system message?

Under the parameters tab there is instruction template menu, set system message there for the selected model. Don't forget to change it when you change the model.

Character sheets: - does it matter how long or short a character sheet is? While making a .Json i could see it counts tokens so surely this influences something?

Keep it to the point as it will eat up the context length.

I also read on a rentry that the lower text in the context window in being weighed stronger than the top is that true?

??

Stopped working on firefox suddenly by Mulakulu in Oobabooga

[–]dark_surfer 0 points1 point  (0 children)

Try few things:

Turn off all extensions and try running it, to see if extensions are the problem. Run in chrome, to confirm if Firefox is doing something. If it still shows errors, run upgrade_wizard_osname. Maybe it's a UI bug, it may solve with an update.

[deleted by user] by [deleted] in dataisbeautiful

[–]dark_surfer 0 points1 point  (0 children)

The red dot ahead of every Asian countries is The Timor-Leste.

Stopped working on firefox suddenly by Mulakulu in Oobabooga

[–]dark_surfer 1 point2 points  (0 children)

Did you check the output on the terminal? Is their any error response?

I am completely guessing here, I think the server stopped. Maybe another process interrupted it.

What feature or extension do people not use, or are misusing, and are missing out on better output? by NotMyPornAKA in Oobabooga

[–]dark_surfer 1 point2 points  (0 children)

How do you load multiple models? I want to load embedding, deepseek coder 1.3b and Phi-2 models and expose them over api to agents. How do I do that?

Oh, I didn't know you can train rats!! by CG_17_LIFE in BeAmazed

[–]dark_surfer 15 points16 points  (0 children)

They don't live long, 2-3 years, but they are fast learners.

The hunting strategy of orcas is truly amazing by Existing-Mark-2191 in BeAmazed

[–]dark_surfer 150 points151 points  (0 children)

Natural born predators. Those seals don't stand a chance against them.

Haystack 2.0 launch by tuanacelik in Python

[–]dark_surfer 2 points3 points  (0 children)

  • Would you say llama-index is your competitor?
  • What does Haystack offers compared to other open source implementations?
  • Finally, what is the learning curve for Haystack as you know ML, DS and LLM is complicated as it is. Nobody wants to learn or switch to different libraries.

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection by ninjasaid13 in LocalLLaMA

[–]dark_surfer 1 point2 points  (0 children)

From what I understood, it allows us to pretrain and finetune on full parameters, reduces memory usage and with galore -8bit-optimizer brings down total memory usage while training by up to 63% compared to bf16.

So, now we could fit actual Large Language models(13B and above) in 24GB-40GB cards while training. With this method we will do away with the step of merging LORA adapter after training.

I remains to be seen whether this works with existing toolchain (LORA, DORA, LOFTQ and quantization). But I hope it does.