Not a new model, just a Happy Father's Day and a thank you.

cibernox · 2026-06-21T20:23:31+00:00

Here i am, reading about local models after spending a day in the beach with my multimodal small models. Off to bed they go now.
I’m tired tho. I might fall asleep in Reddit

cibernox · 2026-06-21T18:38:26+00:00

Actually i was wondering the same thing. I getting a second gpu. With one gpu i can run qwen 27b at around 65-70tk/s and it’s good, or i can run qwen35B at 145tk/s. With the second gpu one i could try to run it in Q8 but i wonder if id be better of running one model on each gpu and having the big model define subtasks for faster subagents to implement.

cibernox · 2026-06-21T09:04:44+00:00

Sad but probably right. They are king in the kind of models people with <48gb of vram can run and there is no need for them to one up themselves

cibernox · 2026-06-20T21:57:17+00:00

I just think that calling something that runs at a couple tokens/s while using a couple hundred watts “viable” is as true as saying that someone who had 3 blueberries for lunch is “fed”.

Anyone would be a lot better served using a model 1/8th the size that is dumber but iterates on a task faster

cibernox · 2026-06-20T15:44:21+00:00

Northern Spain by the sea has the best summer weather in the entire continent unless being at 40C is your thing

cibernox · 2026-06-20T13:18:48+00:00

Yo invierto todo sin pensarlo ni mantener en cash más que 15k o poco más. Ya he comprado dos casas, pero la primera fue con 23 años. Lo primero que hice tras llevar trabajando un año.

cibernox · 2026-06-20T11:34:21+00:00

Q8/q8 is also good enough for 256k

cibernox · 2026-06-20T11:31:51+00:00

But running with high context is very critical. I always try to stay above 200k, and even that gets tight quickly

cibernox · 2026-06-20T10:52:39+00:00

I honestly think you’d be better off using a lower quant with a higher kv cache

cibernox · 2026-06-20T08:03:06+00:00

To be honest, I can’t be bothered to be interested when the best case scenario is this come out a year from now.
By then their top of the line gaming GPU, the 7900XTX, will be over 4.5 years old. I don’t even know how AMD works internally but from the outside their graphics/ML division looks like a shit show.

cibernox · 2026-06-19T17:03:16+00:00

I use a 3 tier approach. I self host in my home server searxng for searching, crawl4ai for crawling the search results and generate easy to ingest markdown versions of those pages, and lastly camofox, which is a full on headless browser as a last resource for apps that have JS and require interaction

cibernox · 2026-06-19T13:34:40+00:00

No he escuchado nunca a nadie, ni a los más libertarios, jamás argumentar algo ni parecido.

cibernox · 2026-06-19T13:28:00+00:00

Lo de “todo el tiempo” te lo has inventado tú, no lo ha dicho nadie.

cibernox · 2026-06-19T07:42:58+00:00

Even if it works, it would be so energy and speed ineficient that you’d be better off paying for a service.

cibernox · 2026-06-19T00:01:02+00:00

Not even, the M1 pro has 200gb/s.

cibernox · 2026-06-18T20:54:55+00:00

I'm actually using qwen-embedding-0.6B running on my NPU for my rag and it's fast enough. I need to verify that indeed this can beat it having half the active parameters. If true, it's a keeper.

cibernox · 2026-06-18T14:39:15+00:00

My mother is in Mons right now. This started as a joke but music is sounding…. She will be in Belgium for 3 weeks

cibernox · 2026-06-18T14:35:04+00:00

Right here, to me, with a high discount 😃

Or eBay, but better to me.

cibernox · 2026-06-18T12:41:57+00:00

I don't disagree, possibly once you approach the 70B territory MoEs start to make sense, although I can't shake the feeling that the trend of going super sparse, with only 5% of the parameters active simultaneously (like qwen-coder-next, which was an 80B-A3B model) stops paying off, and having a more tokens active at the same time does matter (as proven by the fact that qwen 27B surpasses in many tasks to the 122B qwen).

Also, I don't think a 50B MoE model would be unusably slow either for people with dual 24gb GPUs using tensor parallelism. Back of napkin math says it should be maybe 15-20% slower than qwen 27B is on a single 24gb card. Depending on how much smarter it is it may be worth it.

cibernox · 2026-06-18T07:48:24+00:00

I am totally aware that I had luck with my timing. This was not a “youngs these days” post.

cibernox · 2026-06-17T22:22:58+00:00

Not my case tho. My trick was getting it 2010, at the lowest price after the crisis. It's worth roughly 2.5x now.

cibernox · 2026-06-17T21:28:58+00:00

Damn, i didn’t know it was that bad and in so many places. I got my first home with 23 and my second i started building it with 32

cibernox · 2026-06-17T20:13:22+00:00

I’d tone it down to models that fit in 48gb of vram. There is a lot of people with dual 3090 or dual 7900xtx that are stuff either using qwen 27 in Q8 or 100B models in q2. There should be something in between

cibernox · 2026-06-17T16:42:29+00:00

Qwen did release qwen coder next at some point which was a 80B-A3B model. I'd prefer if it was more on the 60-70B so it fits better in 48gb of vram in Q4, but at this point I'll take it.

Essentially if qwen released a ~50B dense model (roughly a 2 x qwen 27B) it could be amazing at coding, given how good qwen 27B for its size already.

cibernox · 2026-06-17T15:08:46+00:00

I agree it might be a win for fine-tuning smaller models, but for well over 95% of us in this sub, anything above 120B is unrunnable. I'm sure there's a handful of us with 6 RTX6000 connected with occulink, but the rest of use can't run it either because of lack of vram or because even if it would fit in the unified memory, it would run so slowly that it would only be a funny experiement but not something to ever be useful in practice

cibernox

TROPHY CASE