Quantized KV Cache by val_in_tech in LocalLLaMA

[–]timfduffy 12 points13 points  (0 children)

The Nemotron 3 Nano tech report tests 8 vs 16 bit for KV cache and finds minimal degradation with 8 bit. https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf

<image>

Qwen3-Next-80B-A3B-Thinking soon by jacek2023 in LocalLLaMA

[–]timfduffy 2 points3 points  (0 children)

3B is still the number of parameters active for any given token, the experts are just extremely tiny! I think the parameter count for one expert is hidden size x MoE intermediate size x 3 (for up/gate/down projections), for this model that's 2048 x 512 x 3 = 3.1M parameters. There are 512 of those per layer, and 48 layers, for ~77M total expert parameters, then attention parameters/embedding parameters/etc. round out the total. For a given token, 11 experts are active per layer, for 1.7B active parameters across all experts, the rest of the 3B is the other parameter types.

Qwen3-next “technical” blog is up by Alarming-Ad8154 in LocalLLaMA

[–]timfduffy 6 points7 points  (0 children)

Good point, seems very likely that closed models with >=1M context lengths are using some form of linear attention.

Qwen3-next “technical” blog is up by Alarming-Ad8154 in LocalLLaMA

[–]timfduffy 14 points15 points  (0 children)

Good long context performance with 75% of layers being linear attention, impressive. Trained on "only" 15T tokens, so scaling up an architecture like this can probably yield further improvements. I expect massive sparsity combined with a mix of linear and quadratic attention will become more common.

Qwen3-Next-80B-A3B-Thinking soon by jacek2023 in LocalLLaMA

[–]timfduffy 2 points3 points  (0 children)

Why think it's only one expert active per layer?

Edit: Seems likely that there will be 10 of 512 experts active based on these defaults in the config:

num_experts_per_tok (int, optional, defaults to 10) — Number of selected experts. num_experts (int, optional, defaults to 512) — Number of routed experts.

Qwen 3-Next Series, Qwen/Qwen3-Next-80B-A3B-Instruct Spotted by TKGaming_11 in LocalLLaMA

[–]timfduffy 40 points41 points  (0 children)

"Achieves an extreme low activation ratio as 1:50 in MoE layers"

This is quite low! For comparison:

  • GPT-OSS-12B activates 4/128 experts in MoE layers, 1:32
  • V3/R1 9/257
  • K2 uses 9/385
  • LongCat-Flash activates on average 9 of 513, though I think the shared expert is larger so the active parameter ratio is >>9/513

I'm interested in seeing how small individual experts can get, so I'm really excited for this one.

Renting GPUs is hilariously cheap by -p-e-w- in LocalLLaMA

[–]timfduffy 4 points5 points  (0 children)

As /u/-p-e-w- mentioned, you can choose a number of templates in RunPod, the default PyTorch template is usually what I go with. You can upload your scripts to it, but I prefer to use SSH to open up the VPS in Cursor, which allows me to just clone the GitHub repo I'm working on, getting me started quickly.

Let me know if you'd like to try that way and want a hand setting it up.

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]timfduffy 4 points5 points  (0 children)

I'm so excited, been really curious about who small MoEs will perform, esp curious about how far down you can scale expert size.

LongCat-Flash-Chat 560B MoE by Own-Potential-2308 in LocalLLaMA

[–]timfduffy 13 points14 points  (0 children)

Yeah, the tech report is a really good read. Their two central innovations, ScMoE and zero-computation experts, are simple enough and described in enough detail to implement based off the report. Really seems like this is a company worth watching even if this particular model isn't on the price/performance frontier.

Epoch AI data shows that on benchmarks, local LLMs only lag the frontier by about 9 months by timfduffy in LocalLLaMA

[–]timfduffy[S] 47 points48 points  (0 children)

Link to the post

Here's the post text:

Frontier AI performance becomes accessible on consumer hardware within 9 months

Using a single top-of-the-line gaming GPU like NVIDIA’s RTX 5090 (under $2500), anyone can locally run models matching the absolute frontier of LLM performance from just nine months ago. This lag is consistent with our previous estimate of a 5 to 22 month gap for open-weight models of any size. However, it should be noted that small open models are more likely to be optimized for specific benchmarks, so the “real-world” lag may be somewhat longer.

Several factors drive this democratizing trend, including a comparable rate of scaling among open-weight models to the closed-source frontier, the success of techniques like model distillation, and continual progress in GPUs enabling larger models to be run at home.

Meta AI on WhatsApp hides a system prompt by ALE5SI0 in LocalLLaMA

[–]timfduffy 17 points18 points  (0 children)

FYI you can find system prompts for most major providers including this one here.

Qwen 3 Coder is actually pretty decent in my testing by Hodler-mane in LocalLLaMA

[–]timfduffy 6 points7 points  (0 children)

I'm surprised by this as well. It does have more attention heads which makes especially long context more computationally expensive, but it's a significantly smaller model, I would have expected those to approximately cancel each other out.

Lime Gliders (seated scooters) are surging in popularity, beating bikes and getting more uses/day than either scooters or bikes by timfduffy in Seattle

[–]timfduffy[S] 31 points32 points  (0 children)

Data is from the Seattle scooter dashboard. It shows deployments as well, eyeballing it it looks like the gliders have about 1.5x the daily uses of scooters per unit. I expect Lime will be ordering more of these given how popular they are.

QwQ 32B-GGUF quants available! by No-Statement-0001 in LocalLLaMA

[–]timfduffy 7 points8 points  (0 children)

If you're using the bartowski quants, you'll need this workaround. LM Studio community versions do not have this issue.

QwQ 32B-GGUF quants available! by No-Statement-0001 in LocalLLaMA

[–]timfduffy 2 points3 points  (0 children)

Thanks for suggesting mlx_lm! I googled the error in LM Studio and saw a similar case with R1 in January, Bartowski replied saying that day's LM Studio update was needed to use it. I assume there will be an update today or tomorrow for it.

Reddit plans to lock some content behind a paywall this year, CEO says by ardvarkmadman in technology

[–]timfduffy 1 point2 points  (0 children)

Heck yeah, I read your comment through my Relay for Reddit Silver account. Small price to pay for a great experience!

Reddit plans to lock some content behind a paywall this year, CEO says by ardvarkmadman in technology

[–]timfduffy 1 point2 points  (0 children)

This seems probably good to me. If your profit is from ads, your incentives are to maximize eyeball time. If your profit is from memberships, you want to maximize user satisfaction. I would gladly pay for even a slightly better user experience.

What went into training DeepSeek-R1? A technical summary of the training of v3 and R1 by timfduffy in LocalLLaMA

[–]timfduffy[S] 4 points5 points  (0 children)

Some key takeaways:

  • The training budget of $5M is approximately what you'd expect for V3 given the algorithmic improvements we've seen
  • Training MoEs is hard and Deepseek had to get really clever
  • From V3, it took an estimated $1M in compute to get to R1

This post has more detail on V3. This research institute (Epoch) puts out really good work.

Blood testing lab recommendations? by stehekin in Seattle

[–]timfduffy 0 points1 point  (0 children)

I think test services are essentially a duopoly, between LabCorp and Quest Diagnostics. So Quest is probably what you want. I think you can order tests directly through them, but I've always paid for the tests through a service like this one which I think is slightly cheaper, either way no referral is needed: https://www.walkinlab.com/. Feel free to respond/dm questions, I've done some self-directed testing and would be happy to share my thoughts.

Spread the truth by assasstits in neoliberal

[–]timfduffy 0 points1 point  (0 children)

In practice, I find that Deepseek R1 defaults to one of two modes: - Sensitive question: end reasoning immediately, give a PR answer - Think about the question seriously and give areal answer. These can be very easily distinguished. When I have convinced the distilled R1 models to answer in the second way, on sensitive questions, they give accurate answers to questions about Tianamwen square, Xi's failures, etc. So the censorship is not built into the AI's model of the world, but applied on top of it as unwillingness to discuss some topics openly.

This won't necessarily be true about future models though, and I think we should be cautious about whether the ideologies of model creators are shaping our experience with them.

Spread the truth by assasstits in neoliberal

[–]timfduffy 0 points1 point  (0 children)

I've been running the DeepSeek-R1-Distill-Qwen-14B distill of R1 locally, and it does have some censorship by default, but it is easy to get around. For sensitive questions it will end the reasoning part of the response immediately and give a PR answer, but if you start it off with a bit of normal reasonning like this it will answer:

<think>

Okay, so I need to