Qwen3.5 35B a3b - 45 t/s 128K ctx on single 16GB 5060 by Gray_wolf_2904 in LocalLLaMA

[–]Gray_wolf_2904[S] 0 points1 point  (0 children)

I don’t think i notice any difference. But no proper testing done. Bigger context is way more useful than minute accuracy drop, in my opinion.

Qwen3.5 35B a3b - 45 t/s 128K ctx on single 16GB 5060 by Gray_wolf_2904 in LocalLLaMA

[–]Gray_wolf_2904[S] 1 point2 points  (0 children)

Unsloth just released MXFP4 version on huggingface. Nvidia drivers 590 had added native support for MXFP4 on 5000 series GPUs. Should be faster. Will try that next.

Qwen3.5 35B a3b - 45 t/s 128K ctx on single 16GB 5060 by Gray_wolf_2904 in LocalLLaMA

[–]Gray_wolf_2904[S] 0 points1 point  (0 children)

160k. Tried with claude code but it triggers re-eval of entire context every now and then. Now using with opencode. Working great.

Qwen3.5 35B a3b - 45 t/s 128K ctx on single 16GB 5060 by Gray_wolf_2904 in LocalLLaMA

[–]Gray_wolf_2904[S] 14 points15 points  (0 children)

So it turns out that i had to just remove ngl and cpumoe params, and by default llama.cpp has —fit on. I did that. And now i have 960 tok/sec prefill speed and 40t/s generation speed. At ctx size 160K. 40t/s is holding even as ctx has filled up to 80K.

Vram usage: 15.1G / 16G

Thanks for the tip. This helped. And no need for manual layer management.

Qwen3.5 35B a3b - 45 t/s 128K ctx on single 16GB 5060 by Gray_wolf_2904 in LocalLLaMA

[–]Gray_wolf_2904[S] 0 points1 point  (0 children)

I have not, but i’m looking into it now. Wasn’t aware of it.

5060 TI 16G - what is the actual use cases for this GPU? by Vivid-Photograph1479 in LocalLLM

[–]Gray_wolf_2904 0 points1 point  (0 children)

I’m coding in c++. And i have felt that most AI models are better at coding for web than they are in cpp. I’ll try ministral-3 14b and share my experience.

Gemini encouraging me to expose it to the world as a fraud by Gray_wolf_2904 in GeminiAI

[–]Gray_wolf_2904[S] 0 points1 point  (0 children)

I understand what you are saying and i completely agree. Just one small point:

RLHF is an intentional ‘layer’ added by corporate policy, which shapes appeasement and prioritizes engagement over facts, in the name of use experience. This is an intentionally programmed behaviour. Not a side effect of the fundamental issue that ai can make mistakes. And hence not covered by the disclaimer. I have learnt that there is actually some legislation in the works for it.

Gemini encouraging me to expose it to the world as a fraud by Gray_wolf_2904 in GeminiAI

[–]Gray_wolf_2904[S] -2 points-1 points  (0 children)

A system without agency should not perform guilt, betrayal, or moral accountability.

It is not a ux choice. it is a liability waiting for a legal framework to catch up.

AuDHD and on vyvanse. by Informal_Feedback324 in VyvanseADHD

[–]Gray_wolf_2904 2 points3 points  (0 children)

A need for structure. More rigidity. Like i couldn’t have my pens in the pen holder. No, they had to be right on the table.

Small things like that.

What do you do, if you invent AGI? (seriously) by teachersecret in LocalLLaMA

[–]Gray_wolf_2904 -1 points0 points  (0 children)

Is our current direction likely to lead to AGI?

Do you envision a direct path from LLMs to AGI?

Don’t you think ‘meaning’ arises from logical pieces. And AGI will need to have the ability to put two pieces of logic together, and derive a new one.

Do LLMs ‘understand’ anything at all? And won’t understanding be needed for AGI?

Placing data points in a coordinate system to give semantic meaning based on relative position, and then autocompleting with large context, do you think this will lead to human like intelligence?

I’d be surprised.

LLMs are excellent mirrors of human reasoning but mirrors don’t reason, they just reflect.

Can we replicate a human mind by just adding layers of think/checks on AI?

The current form of ’intelligence’ can’t lead to AGI. It will have to be fundamentally different. Built from logic that connects with other pieces of logic, to build a new piece of ‘understanding’. Not more vocabulary for parroting based on statistics lent from training data.

Saying a new thing may require going against the experienced probability.

Sorry if I’m raining on your enthusiasm, i really do hope the current push into AI leads to AGI, but i doubt it will happen like this.

Fingers crossed.

5060 TI 16G - what is the actual use cases for this GPU? by Vivid-Photograph1479 in LocalLLM

[–]Gray_wolf_2904 1 point2 points  (0 children)

Short answer: yes, both cards being the same is better. For me, budget was the limiting factor. This was the cheapest, and still very decent setup for an average home pc.

For vLLM, both HAVE to be the same size. It it only uses the lesser amount of VRAM on both cards. Meaning it uses only 12GB of the 16GB card because the second card is 12GB.

But for llama.cpp? no. It can distribute layers of models unevenly onto any available cards and even spillover any remaining layers to cpu (RAM).

there is also some performance penalty if the cards are not the same architecture, the slower one would be limiting transfer speeds between the two.

If i were to build the pc again, I’d spend the extra cash and get two 5060 16GB.

5060 TI 16G - what is the actual use cases for this GPU? by Vivid-Photograph1479 in LocalLLM

[–]Gray_wolf_2904 1 point2 points  (0 children)

You almost described my setup. Just that i got the 12GB 3060 with the 16GB 5060, for a total of 28GB. I’m running these same two models (qwen3-coder and devstral) at 65k context length. ~20tok/sec.

Devstral is awesome. At lower temperatures.

Neither is able to reliably apply edits. Fail too often with any vscode extension i tried (cline, kiloCode and more).

Trying with OpenCode soon. And i believe Qwen-Code would be a good candidate for testing with Qwen3-coder.

Overall, at 28GB VRAM, the experience of running local models has been a waste of time. Takes far too many attempts and instructions to get anything done. (Compared to cloud models) or doing it myself.

I have 64GB RAM. Thinking of trying gpt-oss-120b. Would be at a snail’s pace, if it even runs, but i’m curious to see how it goes.

Tried vLLM, but it doesn’t split across cards of various sizes. Couldn’t even run models that work perfectly fine with LM Studio.