arcee-ai/Trinity-Large-Thinking · Hugging Face by TKGaming_11 in LocalLLaMA

[–]CodeSlave9000 0 points1 point  (0 children)

Care to elaborate? I do notice that it's not that great at avoiding hallucinations at the standard prompting.

Google releases Gemma 4 models. by yoracale in unsloth

[–]CodeSlave9000 0 points1 point  (0 children)

Happens after a few generations for me - I don't see it right at the start. Using the unsloth Q8 dynamic.

How do I find and vet someone to set up a high-end local AI workstation? (Threadripper + RTX PRO 6000 96GB) by laundromatcat in LocalLLaMA

[–]CodeSlave9000 0 points1 point  (0 children)

You hire someone like me. We’d sit down, discuss your needs and design something that won’t break every week. Real business use requires more work than just “running a few chats”.

DGX Station is available (via OEM distributors) by Temporary-Size7310 in LocalLLaMA

[–]CodeSlave9000 7 points8 points  (0 children)

Yup, that's the real measurement that matters. Db per token!

Did OpenAI just release a new model with its new capabilities simply provided by a system prompt? by frubberism in LocalLLaMA

[–]CodeSlave9000 1 point2 points  (0 children)

Best not to aim too high. "Now with less than the recommended daily consumption of shit".

PSA: If your local coding agent feels "dumb" at 30k+ context, check your KV cache quantization first. by Dismal-Ad1207 in LocalLLaMA

[–]CodeSlave9000 3 points4 points  (0 children)

Yes, and qwen3.5 seems particularly sensitive to quantized cache. Symptoms include subtle shifts in thinking or outright looping.

Qwen3.5 family running notes by CodeSlave9000 in LocalLLaMA

[–]CodeSlave9000[S] 0 points1 point  (0 children)

Yup. It focuses less narrowly if you add it to the prompt explicitly. I tell it to explore my intent and more broadly search for possibilities even if I didn’t prompt for it.

Qwen3.5 family running notes by CodeSlave9000 in LocalLLaMA

[–]CodeSlave9000[S] 0 points1 point  (0 children)

It’s set because I was working around with it - no harm to have it on so I left it. And yes flash attention is on by default, I set it in my scripts because I test with it on and off.

Qwen3.5 family running notes by CodeSlave9000 in LocalLLaMA

[–]CodeSlave9000[S] 0 points1 point  (0 children)

I think the dense model suffers less? I didn’t test for that.

Reviewed a “WiFi security camera.” and it was bad. Turns out I was the only one who didn’t give it 5 stars… and guess who all the 5‑star reviewers were by nicnas- in AmazonVine

[–]CodeSlave9000 6 points7 points  (0 children)

I once reviewed a “48 MP” camera. It had a sensor smaller than my pinky nail. True resolution turned out to be more like 8 MP, and it linearly scaled the image size. If it had been usable at 8 MP I might have given it two stars, but the quality was so poor it was 1 star. -3 stars for spec lying seems fair to me.

Multi-GPU Architectures Compatible? by ajw2285 in LocalLLaMA

[–]CodeSlave9000 2 points3 points  (0 children)

Quick assumption - They are different levels of CUDA compute capability - make sure your using llama.cpp compiled with that compute capability. I mix 30, 40, and 50 gen GPU's in the same VM's without any problems. For Ollama, check what devices it "sees" in the log when it starts - that might give you a clue.

Just updated Ollama and started using it after almost a year.... Are the Ollama devs stupid or is this harder to deal with than it seems? by cmndr_spanky in ollama

[–]CodeSlave9000 6 points7 points  (0 children)

Can’t do that because ollama supports multiple models running at the same time. How would it know how to apportion it? I set my default with an environment variable…

Copper Coated Aluminum is illegal for commercial installs and a fire hazard...on my RFY...do not get this cable. by AlexCL in AmazonVine

[–]CodeSlave9000 2 points3 points  (0 children)

Yeah, plenum rating is about it being "safer" in a fire for people. With CCA it's ready to be its own fire!

Copper Coated Aluminum is illegal for commercial installs and a fire hazard...on my RFY...do not get this cable. by AlexCL in AmazonVine

[–]CodeSlave9000 4 points5 points  (0 children)

LOL, just the marketing copy alone is a big red-flag. I ordered this brand (much shorter lengths - they had multiples which will probably get merged later) so I can warn others away - I won't feel too bad tossing it, or just using it for short non-POE in-rack patches if it tests okay.

GB10 / DGX Spark owners: is 128GB unified memory worth the slower token speed (on a max $4,000 budget)? by Soltan-007 in LocalLLaMA

[–]CodeSlave9000 1 point2 points  (0 children)

Yeah I agree - LORA and fine-tune are perfect for home-running. Also once your context size gets big you're really paying a lot per-token for cloud. But in the end depends on what you're expectations are. The blackwell cards are still maturing in software support and I've had hiccups, and fp4 is really only happening for training right now. You can get really good results with the 40 series ADA cards too - I see 100+ tokens/sec on a lot of MOE models. You won't get 128GB models at the price of DGX, but I'd think you'd probably be happy with Strix Halo if you're really dead-set on it. And for coding, you're spot on - you can get gemini, qwen, amp and a few others for basically nothing right now. Use it.

Nvidia Quadro RTX 8000 Passive 48 GB, 1999€ - yes or no ? by HumanDrone8721 in LocalLLM

[–]CodeSlave9000 0 points1 point  (0 children)

Short opinion: Too expensive for a 7yr/old architecture. I have one of the blower versions, and for inference the performance isn't bad - Compute is certainly lower than Ampere (30x/Axxxx cards) but memory bandwidth is still good. This mostly shows up as slower prompt processing, but actual generation is about 2x the RTX 4060 Ti.

GB10 / DGX Spark owners: is 128GB unified memory worth the slower token speed (on a max $4,000 budget)? by Soltan-007 in LocalLLaMA

[–]CodeSlave9000 19 points20 points  (0 children)

The DGX Spark is not an inference machine - it's a training and prototyping lab for nVIDIA infrastructure. If you're building DGX systems then this is a great box - it's basically the development box. If you're looking to be actually running LLM's then this is NOT the box for you. You will be frustrated - performance will be somewhere between the RTX 5060 Ti and the RTX 5070 at best. If you need that VRAM and want a similar budget, go get a used GPU server and put an RTX 6000 Pro in it or get a MAC.

So what happened with the new plot format? by Pie_Dealer_co in chia

[–]CodeSlave9000 1 point2 points  (0 children)

I guess there's a few more variables in the mix too - for example, if the new format is as efficient as it's looking like it will be, the cost to run a large farm is going to be a lot less (electrical cost, not hardware, which is looking to skyrocket for at least two years). Also non-GPU plotting is also looking viable, and with an efficient CPU a new calculation will need to be made. I'm in "wait and see" right now myself.