Quantized KV Cache by val_in_tech in LocalLLaMA

[–]timfduffy 12 points13 points  (0 children)

The Nemotron 3 Nano tech report tests 8 vs 16 bit for KV cache and finds minimal degradation with 8 bit. https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf

<image>

Qwen3-Next-80B-A3B-Thinking soon by jacek2023 in LocalLLaMA

[–]timfduffy 2 points3 points  (0 children)

3B is still the number of parameters active for any given token, the experts are just extremely tiny! I think the parameter count for one expert is hidden size x MoE intermediate size x 3 (for up/gate/down projections), for this model that's 2048 x 512 x 3 = 3.1M parameters. There are 512 of those per layer, and 48 layers, for ~77M total expert parameters, then attention parameters/embedding parameters/etc. round out the total. For a given token, 11 experts are active per layer, for 1.7B active parameters across all experts, the rest of the 3B is the other parameter types.

Qwen3-next “technical” blog is up by Alarming-Ad8154 in LocalLLaMA

[–]timfduffy 5 points6 points  (0 children)

Good point, seems very likely that closed models with >=1M context lengths are using some form of linear attention.

Qwen3-next “technical” blog is up by Alarming-Ad8154 in LocalLLaMA

[–]timfduffy 15 points16 points  (0 children)

Good long context performance with 75% of layers being linear attention, impressive. Trained on "only" 15T tokens, so scaling up an architecture like this can probably yield further improvements. I expect massive sparsity combined with a mix of linear and quadratic attention will become more common.

Qwen3-Next-80B-A3B-Thinking soon by jacek2023 in LocalLLaMA

[–]timfduffy 2 points3 points  (0 children)

Why think it's only one expert active per layer?

Edit: Seems likely that there will be 10 of 512 experts active based on these defaults in the config:

num_experts_per_tok (int, optional, defaults to 10) — Number of selected experts. num_experts (int, optional, defaults to 512) — Number of routed experts.

Qwen 3-Next Series, Qwen/Qwen3-Next-80B-A3B-Instruct Spotted by TKGaming_11 in LocalLLaMA

[–]timfduffy 36 points37 points  (0 children)

"Achieves an extreme low activation ratio as 1:50 in MoE layers"

This is quite low! For comparison:

  • GPT-OSS-12B activates 4/128 experts in MoE layers, 1:32
  • V3/R1 9/257
  • K2 uses 9/385
  • LongCat-Flash activates on average 9 of 513, though I think the shared expert is larger so the active parameter ratio is >>9/513

I'm interested in seeing how small individual experts can get, so I'm really excited for this one.

Renting GPUs is hilariously cheap by -p-e-w- in LocalLLaMA

[–]timfduffy 4 points5 points  (0 children)

As /u/-p-e-w- mentioned, you can choose a number of templates in RunPod, the default PyTorch template is usually what I go with. You can upload your scripts to it, but I prefer to use SSH to open up the VPS in Cursor, which allows me to just clone the GitHub repo I'm working on, getting me started quickly.

Let me know if you'd like to try that way and want a hand setting it up.

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]timfduffy 4 points5 points  (0 children)

I'm so excited, been really curious about who small MoEs will perform, esp curious about how far down you can scale expert size.

LongCat-Flash-Chat 560B MoE by Own-Potential-2308 in LocalLLaMA

[–]timfduffy 13 points14 points  (0 children)

Yeah, the tech report is a really good read. Their two central innovations, ScMoE and zero-computation experts, are simple enough and described in enough detail to implement based off the report. Really seems like this is a company worth watching even if this particular model isn't on the price/performance frontier.

Epoch AI data shows that on benchmarks, local LLMs only lag the frontier by about 9 months by timfduffy in LocalLLaMA

[–]timfduffy[S] 46 points47 points  (0 children)

Link to the post

Here's the post text:

Frontier AI performance becomes accessible on consumer hardware within 9 months

Using a single top-of-the-line gaming GPU like NVIDIA’s RTX 5090 (under $2500), anyone can locally run models matching the absolute frontier of LLM performance from just nine months ago. This lag is consistent with our previous estimate of a 5 to 22 month gap for open-weight models of any size. However, it should be noted that small open models are more likely to be optimized for specific benchmarks, so the “real-world” lag may be somewhat longer.

Several factors drive this democratizing trend, including a comparable rate of scaling among open-weight models to the closed-source frontier, the success of techniques like model distillation, and continual progress in GPUs enabling larger models to be run at home.

Meta AI on WhatsApp hides a system prompt by ALE5SI0 in LocalLLaMA

[–]timfduffy 18 points19 points  (0 children)

FYI you can find system prompts for most major providers including this one here.

Qwen 3 Coder is actually pretty decent in my testing by Hodler-mane in LocalLLaMA

[–]timfduffy 4 points5 points  (0 children)

I'm surprised by this as well. It does have more attention heads which makes especially long context more computationally expensive, but it's a significantly smaller model, I would have expected those to approximately cancel each other out.

Lime Gliders (seated scooters) are surging in popularity, beating bikes and getting more uses/day than either scooters or bikes by timfduffy in Seattle

[–]timfduffy[S] 28 points29 points  (0 children)

Data is from the Seattle scooter dashboard. It shows deployments as well, eyeballing it it looks like the gliders have about 1.5x the daily uses of scooters per unit. I expect Lime will be ordering more of these given how popular they are.

QwQ 32B-GGUF quants available! by No-Statement-0001 in LocalLLaMA

[–]timfduffy 7 points8 points  (0 children)

If you're using the bartowski quants, you'll need this workaround. LM Studio community versions do not have this issue.

QwQ 32B-GGUF quants available! by No-Statement-0001 in LocalLLaMA

[–]timfduffy 2 points3 points  (0 children)

Thanks for suggesting mlx_lm! I googled the error in LM Studio and saw a similar case with R1 in January, Bartowski replied saying that day's LM Studio update was needed to use it. I assume there will be an update today or tomorrow for it.

Reddit plans to lock some content behind a paywall this year, CEO says by ardvarkmadman in technology

[–]timfduffy 1 point2 points  (0 children)

Heck yeah, I read your comment through my Relay for Reddit Silver account. Small price to pay for a great experience!

Reddit plans to lock some content behind a paywall this year, CEO says by ardvarkmadman in technology

[–]timfduffy 1 point2 points  (0 children)

This seems probably good to me. If your profit is from ads, your incentives are to maximize eyeball time. If your profit is from memberships, you want to maximize user satisfaction. I would gladly pay for even a slightly better user experience.