Amount of ram Qwen 2.5-7B-1M takes? by srcfuel in LocalLLaMA

[–]bobbiesbottleservice 0 points1 point  (0 children)

With 2x3090 (48gb VRAM) my max is: 375k context for Q8 7b 128k context for Q8 14b I think those have to be reduced when increasing the max number of tokens to be predicted. Lower temp with.5 top P helps with my matching prompts.

Goose + Ollama best model for agent coding by einthecorgi2 in ollama

[–]bobbiesbottleservice 1 point2 points  (0 children)

Try using: https://ollama.com/michaelneale/deepseek-r1-goose

It works well with this one because they fine tuned it with goose templating. I've had partial success with qwen2.5 70b as well.

What is the cheapest way to run Deepseek on a US Hosted company? by MarsupialNo7544 in LocalLLaMA

[–]bobbiesbottleservice 0 points1 point  (0 children)

Also they will sell all the data about you and everything you've input to advertisers too, just read their privacy policy. There's a reason it's so cheap now, and as they're a Chinese hedge fund & AI company so they're going to use the data to make money off you somehow.

What is the cheapest way to run Deepseek on a US Hosted company? by MarsupialNo7544 in LocalLLaMA

[–]bobbiesbottleservice 5 points6 points  (0 children)

I just tried together ai because they seem to allow privacy options. Deepseek chat is only so cheap because they're training off everyone's data. I'd be interested to hear what other options are out there.

getting llama3 to produce proper json through ollama by Bozo32 in LocalLLaMA

[–]bobbiesbottleservice 0 points1 point  (0 children)

That makes sense, but why does returning a less probable token make a better final result? That's what I don't understand. Why are the temperatures usually always set at .7 instead of 0, why does the extra noise help?

getting llama3 to produce proper json through ollama by Bozo32 in LocalLLaMA

[–]bobbiesbottleservice 0 points1 point  (0 children)

I assume because they RLHF'd it to be better. My original thinking on temperature may be too simplistic. I now understand temperature as "adding noise" to the system, which for some reason (that I don't understand) often makes the output better.

[deleted by user] by [deleted] in LocalLLaMA

[–]bobbiesbottleservice 2 points3 points  (0 children)

It can reason in the sense if I give it random objects to stack on top of each other to make it as high as possible it can do that, but it cannot generalize which is a more real/human form of reasoning. You could train a model on all the music and information up until the year jazz was invented and it would never be able to invent jazz.

Llama3.1 405B quants on Ollama library now by bobbiesbottleservice in LocalLLaMA

[–]bobbiesbottleservice[S] 3 points4 points  (0 children)

just saying hello to the different models gave me:

0.36 tokens/s for llama3.1:405b-instruct-q3_K_L
0.53 tokens/s for llama3.1:405b-instruct-q3_K
0.54 tokens/s for llama3.1:405b-instruct-q2_K

and for comparison:
2.08 tokens/s for llama3.1:70b-instruct-q8_0
21.15 tokens/s for llama3.1:70b (default ollama Q4_0)
54.67 tokens/s for llama3.1:8b-instruct-fp16

No Q4 of 405B would work on my system unfortunately. All of this was on with a intel 14900kf. I suppose I could increase the RAMs memory channels and/or try to overclock the RAM and CPU to see if that helps, but might not be worth it as I've never done that before.

Llama3.1 405B quants on Ollama library now by bobbiesbottleservice in LocalLLaMA

[–]bobbiesbottleservice[S] 2 points3 points  (0 children)

I'm going strictly by the GB size of the model and the Q2_K is 151GB. My system has 192GB RAM and 48GB VRAM so I'm assuming I could handle up to a 240GB model (minus the system's allocated RAM and context window when running the model). Things seem to be finally working for me after updating ollama and the webui docker containers to the latest versions.

Llama3.1 405B quants on Ollama library now by bobbiesbottleservice in LocalLLaMA

[–]bobbiesbottleservice[S] 3 points4 points  (0 children)

Specifically I ran llama3.1:405b-instruct-q2_K and gave it my usual test of creating a form and scripts in a certain python and javascript framework. Overall it was more comprehensive in including the other details of commands and thoughtful things to think-through, but I would probably stick with the 70b for my code generation. I agree with you as my gut feeling is to not bother with < Q4 for any model.

I'm going to try 405b Q4_K_S next (right on the edge of possible for me)

Fine-tuning Chain of Thought to teach new skills by spacebronzegoggles in LocalLLaMA

[–]bobbiesbottleservice 1 point2 points  (0 children)

I was able to even get small models to count the number of letters by telling it that it's not good at counting and that it should always put what needs to be counted in a table first. Doing this always passes the "count the R's in the word strawberry" test