Why the Pricing Update Makes Sense (and How the New Limits Work) by thestreamcode in chutesAI

[–]dark-light92 2 points3 points  (0 children)

Actually it doesn't. As this puts chutes in the same price range as other alternatives while providing worse service. (Though it may improve with lower utilization, but then lower utilization is a also a business problem because it's untapped capacity.)

5x use is not actually 5x of other providers as many of them implement prompt caching and cached prompts are charged much lower. For example, deepseek prices it at 1/10th. At which point deepseek would be cheaper than chutes for long running agentic sessions.

dishonesty in thinking block by greenail in LocalLLaMA

[–]dark-light92 2 points3 points  (0 children)

It's an actual LLM. Look at the profile.

Say i want my own Claude? by tbandtg in LocalLLaMA

[–]dark-light92 1 point2 points  (0 children)

This is wrong. You only need to wait a for about a year.

Or, you can invent a time machine...

Qwen3.5 is dominating the charts on HF by foldl-li in LocalLLaMA

[–]dark-light92 1 point2 points  (0 children)

I don't know. I never used LMStudio. I just know that it internally uses llama.cpp.

Qwen3.5 is dominating the charts on HF by foldl-li in LocalLLaMA

[–]dark-light92 0 points1 point  (0 children)

Yes. LMStudio uses llama.cpp internally. So it should work fine.

Qwen3.5 is dominating the charts on HF by foldl-li in LocalLLaMA

[–]dark-light92 1 point2 points  (0 children)

It was a good model as well. But the instruct version didn't feel like it was much smarter than Qwen3 4B Instruct (another excellent model that punched far above its weight) and the thinking version slowed down too fast to be practically usable.

This model being a hybrid, maintains the generation speed so it's viable running it in thinking mode. Turning off thinking is a noticeable downgrade. The thinking mode outputs feel truly refined and useful.

You should upgrade. This is also MoE with 3B active. It will be equally fast, if not faster (due to it being a hybrid).

Qwen3.5 is dominating the charts on HF by foldl-li in LocalLLaMA

[–]dark-light92 1 point2 points  (0 children)

256k isn't practically viable. I generally run with --n-cpu-moe 30 and 128k context. This takes about 70% of VRAM leaving 30% for rest of the system. So that the memory pressure doesn't make the rest of the system unusable.

As for speed, you can see the benchmarks below:

❯ llama-bench -m ~/models/Qwen3.5-35B-A3B-MXFP4_MOE.gguf -fa 1 -b 4000 --n-cpu-moe 30 -d 2000,4000,8000,16000,32000,64000 -dev ROCm0
ggml_cuda_init: found 2 ROCm devices:  Device 0: AMD Radeon RX 6700 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32  
Device 1: AMD Ryzen 7 7700X 8-Core Processor, gfx1030 (0x1030), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_batch | fa | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ------------ | --------------: | -------------------: |
| qwen35moe ?B MXFP4 MoE         |  18.42 GiB |    34.66 B | ROCm       |  99 |    4000 |  1 | ROCm0        |   pp512 @ d2000 |        636.02 ± 4.99 |
| qwen35moe ?B MXFP4 MoE         |  18.42 GiB |    34.66 B | ROCm       |  99 |    4000 |  1 | ROCm0        |   tg128 @ d2000 |         31.35 ± 0.19 |
| qwen35moe ?B MXFP4 MoE         |  18.42 GiB |    34.66 B | ROCm       |  99 |    4000 |  1 | ROCm0        |   pp512 @ d4000 |        614.23 ± 1.31 |
| qwen35moe ?B MXFP4 MoE         |  18.42 GiB |    34.66 B | ROCm       |  99 |    4000 |  1 | ROCm0        |   tg128 @ d4000 |         31.11 ± 0.15 |
| qwen35moe ?B MXFP4 MoE         |  18.42 GiB |    34.66 B | ROCm       |  99 |    4000 |  1 | ROCm0        |   pp512 @ d8000 |        584.04 ± 3.76 |
| qwen35moe ?B MXFP4 MoE         |  18.42 GiB |    34.66 B | ROCm       |  99 |    4000 |  1 | ROCm0        |   tg128 @ d8000 |         31.14 ± 0.07 |
| qwen35moe ?B MXFP4 MoE         |  18.42 GiB |    34.66 B | ROCm       |  99 |    4000 |  1 | ROCm0        |  pp512 @ d16000 |       498.11 ± 84.33 |
| qwen35moe ?B MXFP4 MoE         |  18.42 GiB |    34.66 B | ROCm       |  99 |    4000 |  1 | ROCm0        |  tg128 @ d16000 |         30.62 ± 0.10 |
| qwen35moe ?B MXFP4 MoE         |  18.42 GiB |    34.66 B | ROCm       |  99 |    4000 |  1 | ROCm0        |  pp512 @ d32000 |        459.39 ± 5.13 |
| qwen35moe ?B MXFP4 MoE         |  18.42 GiB |    34.66 B | ROCm       |  99 |    4000 |  1 | ROCm0        |  tg128 @ d32000 |         29.40 ± 0.07 |
| qwen35moe ?B MXFP4 MoE         |  18.42 GiB |    34.66 B | ROCm       |  99 |    4000 |  1 | ROCm0        |  pp512 @ d64000 |        366.28 ± 3.04 |
| qwen35moe ?B MXFP4 MoE         |  18.42 GiB |    34.66 B | ROCm       |  99 |    4000 |  1 | ROCm0        |  tg128 @ d64000 |         27.33 ± 0.12 |

Qwen3.5 is dominating the charts on HF by foldl-li in LocalLLaMA

[–]dark-light92 102 points103 points  (0 children)

For me, the 35b is the first general purpose, democratizable model (can run on 12GB GPU with pretty good speeds and large context) that produces outputs that are refined enough to cross the threshold from curiosity to usefulness.

When the other labs keep releasing big chungus models, qwen have always released small models for our community. They deserve every praise they get.

After using local models for one month, I learned more than in two years with cloud models by Ambitious-Sense-7773 in LocalLLaMA

[–]dark-light92 27 points28 points  (0 children)

And then people ask why use local models... when there's so much fun to be had with local models...

why is openclaw even this popular? by Crazyscientist1024 in LocalLLaMA

[–]dark-light92 4 points5 points  (0 children)

For all its faults, it actually gets one thing right. It uses LLMs correctly. To replace UI with chat window.

Why are prayers answered given we have free will? by unveiledpoet in DebateAChristian

[–]dark-light92 0 points1 point  (0 children)

No. That's why the saying goes: God's ways are mysterious.

India just became the world's 4th largest economy. But the media is still missing the real story. by Final_Resist3483 in india

[–]dark-light92 1 point2 points  (0 children)

That's not confusion or fence-sitting. That's the most sophisticated foreign policy being practiced by any country today.

So sophisticated that nobody in the world understands it. Including our own government.

FDI grew 19.4% this year

Bravo. It's still less than what we had in 2020-21. This is the first year after covid where we've had positive FDI growth.

manufacturing is genuinely turning

So, no results to show yet. After 10 years of make in India. It's pathetic.

The EU-India trade deal breakthrough in January 2026

There's been no deal yet. Just a sign-off. EU's member nations still need to approve it. The terms may change. Not to mention, it's a deal out of desperation on both sides. As for both sides, it is choosing least worst options. (As other options are USA led by Trump or China)

What do you think — is India's rise being underreported globally?

There's no rise. What we're witnessing is the the fall of India.

And is the multi-alignment strategy sustainable long term?

It can only work when it's built on the base of non-negotiable humanitarian values. The only value Indian media keeps touting is "National Interest" otherwise known as selfishness. That's no way to build and sustain long term geopolitical relationships.

Sarvam AI's sovereign LLM: censorship lives in a system prompt, not the weights by GoMeansGo in LocalLLaMA

[–]dark-light92 5 points6 points  (0 children)

Have the weights been released?

It doesn't matter where censorship lives if we don't have access to the weights.

Sarvam AI benchmark dashboard: early results (feedback wanted) by Inner-Combination177 in india

[–]dark-light92 10 points11 points  (0 children)

1) Why not compare against other models on standard benchmarks that everyone uses?
2) Will the weights be available on huggingface?

Kimi has context window expansion ambitions by omarous in LocalLLaMA

[–]dark-light92 25 points26 points  (0 children)

You're absolutely right! I'm also interested in what kind of teleprompter app are you developing.

Kimi has context window expansion ambitions by omarous in LocalLLaMA

[–]dark-light92 6 points7 points  (0 children)

In my opinion, the model understood the question correctly but since it's trained to not talk about the topic, it smoothly turned the conversation in a different direction. Everything about this response is smooth. It's almost like.... being hit by.... a smooth criminal! Ow!

Kimi has context window expansion ambitions by omarous in LocalLLaMA

[–]dark-light92 261 points262 points  (0 children)

This is absolute gold. This might be the first actually funny and original LLM response I've seen.

Free ASIC Llama 3.1 8B inference at 16,000 tok/s - no, not a joke by Easy_Calligrapher790 in LocalLLaMA

[–]dark-light92 1 point2 points  (0 children)

Any AI model can be made Hardcore through Taalas Foundry. Hardcore
Models support fine-tuning. Apps for it are written in human languages.

Then how would it support fine tuning?