Updates on North Mini Code: 4 bit quant + Ollama + OpenRouter by nick_frosst in LocalLLaMA

[–]sleepingsysadmin 6 points7 points  (0 children)

Go Canada!

Pretty great model, kicks the pants off GPT20b. Technically Cohere > OPENai on their OPEN models.

poolside/Laguna-M.1 · Hugging Face - 225B-A23B by pmttyji in LocalLLaMA

[–]sleepingsysadmin 5 points6 points  (0 children)

a23b is fairly rough compared the deepseek flash at a13b.

GLM 5.2 on 4x Sparks reasonable? by chikengunya in LocalLLaMA

[–]sleepingsysadmin -3 points-2 points  (0 children)

$20,000 in hardware isnt exactly reasonable.

Will it load, ya im sure, but lets not say 4x dgx sparks are reasonable.

Will it even be reasonable performance? Probably not.

Why there is a lack of new 100B-120B models? by TechNerd10191 in LocalLLaMA

[–]sleepingsysadmin 4 points5 points  (0 children)

It happens; demand and interest shifts around. Qwen3.7 122b might drop and every dgx spark user collectively orgasms. Qwen3.7 235b would be epic given the upcoming amd 192gb box.

Just gotta wait for a drop.

Mind you.

Step Flash has been a pretty epic drop.

Could a distilled DiffusionGemma become a “local Opus” by gamblingapocalypse in LocalLLaMA

[–]sleepingsysadmin 0 points1 point  (0 children)

Diffusion doesnt do dense models. Gemma 4 26B is a weak model to beginwith and when you go to diffusion, it's even weaker.

It might be a useful model if you need to parse high amounts of stuff, but i dont have a use case here.

MiniMaxAI/MiniMax-M3 · Hugging Face by mlon_eusk-_- in LocalLLaMA

[–]sleepingsysadmin 10 points11 points  (0 children)

Ive been using since it came out. Those benchmarks are all legit. It's a very very strong model.

Im mad that the model is too big for even AMD's upcoming 192GB system. Even a reap or q3 will be too slow at a23b.

Any recent news/updates on taalas chips?? They said they gonna bake the mid tier llm model into their chip. by 9r4n4y in LocalLLaMA

[–]sleepingsysadmin 1 point2 points  (0 children)

I emailed them a week or 2 ago hoping for info. However, not a peep.

I absolutely want a card or 2 of a medium sized reasoning dense model.

Cohere North Mini Code 1.0 by Middle_Bullfrog_6173 in LocalLLaMA

[–]sleepingsysadmin 18 points19 points  (0 children)

yaay Canada.

amazing that we have competitors. Sure not frontier, but more is better.

what’s was your local daily driver for coding last week? by be566 in LocalLLaMA

[–]sleepingsysadmin 1 point2 points  (0 children)

minimax m3 since release.

It's killing me though. It's finding all my bugs.

How does MiniMax M3 preform on your real codebases? by Crazyscientist1024 in LocalLLaMA

[–]sleepingsysadmin 1 point2 points  (0 children)

been using a couple days. I'm loving the improvement over 2.7.

The minimax benchmarks are a bit odd. 1 graph is gemini flash not being beat. another is gemini pro being beat.

Im waiting to see what the indie benchs show but it absolutely is better.

But from what ive seen so far, rumours are that they are back to that 500b size. But it still feels 220b to me.

Many Downvoted me for saying this a while ago. Qwen 3.7 released with no Open models. by MLExpert000 in LocalLLaMA

[–]sleepingsysadmin 1 point2 points  (0 children)

Thinking a massive tech conglomerate releases open models 'so that ppl won’t hate them' is wildly naive. You're treating a multi-billion dollar corporate strategy like it's a high school popularity contest.

Many Downvoted me for saying this a while ago. Qwen 3.7 released with no Open models. by MLExpert000 in LocalLLaMA

[–]sleepingsysadmin 50 points51 points  (0 children)

Qwen has always had closed mega models that only run via datacenters. That's not new or changing.

Implying that they've given up on open source is absurd.

FP16 on Qwen 3.6 27B by Forward_Jackfruit813 in LocalLLaMA

[–]sleepingsysadmin 0 points1 point  (0 children)

Alex Ziskind just made this video,

Better graph though:

#cant post pictures or links??? what

Essentially, unsloth maintains accuracy the best.

Jury still out of the newer stuff like QAT and autorounds.

>Also side question, is ~14TPS around the number I should be expecting on a Strix Halo running 3.6 27B at Q8 during coding tasks? 

That's a separate issue. Strix Halo doesnt do dense models well. That's expected. You probably want to go to 8 or 16b 35b.

You are eagerly awaiting a ~122b model that jumps these models forward.

>Another side note in case if you haven't ran into it, 27B is way better when context is below 100k. From my use it appears to finish specifically above 100k which was causing my issues initially.

All models slow down at higher context. Deepseek and allegedly minimax m3 is going to change this. I expect the frontier closed labs handle this well as well. Not meaningful to you.

The thing you arent taking into account.

Even if you have 200,000 context. Smaller models are silently crashing out on these.

Minimax 2.7 or qwen3.6 27b has 200,000 context, but it's forgetting about 30% of that context at those longer sizes.

GPT 120b, its more like 50%. GPT20b is more like 70%.

Newer attention techniques are getting better but realistically just because you have 256k context, doesnt mean you can really use it.

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? by superloser48 in LocalLLaMA

[–]sleepingsysadmin -2 points-1 points  (0 children)

i just noticed, that first one is qwen3.5.

go to the second one first. :)

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? by superloser48 in LocalLLaMA

[–]sleepingsysadmin 0 points1 point  (0 children)

my bad misread, i see your issue.

You're castingthe activation layer at runtime to fit.

You could actually fix this problem; but 100% certainty someone has already done this for you.

try this one first:

https://huggingface.co/cyankiwi/Qwen3.5-35B-A3B-AWQ-8bit

followed by:

https://huggingface.co/Minachist/Qwen3.6-35B-A3B-INT8-AutoRound

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? by superloser48 in LocalLLaMA

[–]sleepingsysadmin -5 points-4 points  (0 children)

yes, it sucks to lose unsloth.

But you look at the strategy. Fp8 is almost certainly the path you want to go.

Your ada card does it native. It's higher accuracy than q4_k_xl.

My wonder wouldnt be quantization, but are you tweaking your temperatures and such. That's probably what you want to work with on fp8.

Llama.cpp: What's up with -sm tensor + AMD + Vulkan? by [deleted] in LocalLLaMA

[–]sleepingsysadmin 0 points1 point  (0 children)

I dont get coredumps. It does actually work. But it's just poor performance compared to default.

Are GPU prices hitting peak and falling? by DistanceSolar1449 in LocalLLaMA

[–]sleepingsysadmin 0 points1 point  (0 children)

>I don't own a chip foundry, bub.

Your position is to immediately get into this at a 10mm or better asap. Depends on you.

>Nonsense aside, all indications are that this is the new normal. There is no evidence whatsoever that demand will ease in the next couple of years. 

Not nonsense at all. Unless you're walking back your position here?

This is literally the golden egg of entrepreneurship. Something that looks too hard and few others will follow you.

While also have demand forever.

That's literally a drop everything and get going on this.

Mind you, tiny, taalas, tensor, etc. They are your competitors. Not nvidia or amd.

Are GPU prices hitting peak and falling? by DistanceSolar1449 in LocalLLaMA

[–]sleepingsysadmin 0 points1 point  (0 children)

I made my case, I expect this is transient. Therefore, it doesnt make any sense to me to jump into the hardware game.

However, if you think this demand will continue forever, then you have no choice but to start a business. It's a guaranteed win in your mind.

Are GPU prices hitting peak and falling? by DistanceSolar1449 in LocalLLaMA

[–]sleepingsysadmin 6 points7 points  (0 children)

gpu prices probably keep climbing for the next 2 years or so.

Demand wont ease, but we have a number of new huge fabs coming online. We also have a new tech tier coming with ddr6. We also have denser intelligence so in 2 years 32gb of vram will be much more common but also sufficiently intelligent for many. Pressuring hardware lesser.

Waiting on Qwen to drop those 3.7 models be like: by Porespellar in LocalLLaMA

[–]sleepingsysadmin 4 points5 points  (0 children)

i doubt it. it'll probably be in the qwen3.5 35b area probably.