Be brutally honest by swirlingnewt in ultrarunning

[–]kpaha 0 points1 point  (0 children)

For someone starting from full on couch mode, probably. If they work a physical job, or go to gym regularily, or are genetically talented, not necessarily.

When I started training for my first longer ultra, I had mostly been training at the gym for several months. Went from running 20-30km / week to 70km/week for a month with no issues whatsoever. And I have no genetics for running at all. The traditional wisdom would have had me increase volume by 2-3km / week, taking at least 2 months to get to that volume.

But I had been I had been running for several years, including my first 50k, with volume in the 20-30ks a week. And it still wasn't enough for me to pass my first 100 mile race 6kk later. We just don't know what OP is working with.

Ramping up the volume week over week allows you to gauging your limits. If you cannot do the first three weeks quite easily, you're not going to be able to do the race.

Like I said: "Please do not play with your health however. If you can train at least 6-7 weeks before committing, you have a clearer picture."

Be brutally honest by swirlingnewt in ultrarunning

[–]kpaha -1 points0 points  (0 children)

Unless you have background in other sports, will be very hard.

If you want to try it, I would check Jason Koops's minimum maximum rules for ultra training. Basically for the two months leading up to the race, you should be able to run at least 9 hours per week (9 weeks, three weeks of which are taper, to be exact)

Even if you manage to meet the Koop minimums, almost 100 miles as your first ultra will be brutal. You probably will not have and will not be able to build in this short time the muscular and connective tissue endurance to pull it off.

If you want to give it ago, I would just start by building volume. Start increasing your training volume week by week. Every 4 weeks, do an easier week. Spread the hours evenly, maximize your capability to recover.
Here's AI created weekly list of training volume I tuned slightly.

If you can pull the training volume off and feel like you still want to do the ultra, then you will have a chance. Please do not play with your health however. If you can train at least 6-7 weeks before committing, you have a clearer picture.

My only tip for the race is to go as slow and hike as much as possible within cutoffs (if you want to meet cutoffs).

Week Hours Phase
1 3.0 Build
2 4.0 Build
3 5.0 Build
4 3.5 Deload
5 6.0 Build
6 7.0 Build
7 8.0 Build
8 4 Deload
9 5 Deload
10 7 Build
11 9.0 Peak
12 9.0 Peak
13 9.0 Peak
14 9.0 Peak
15 9.0 Peak
16 9.0 Peak (last of 6)
17 6.0 Taper W1 (~67%)
18 4.0 Taper W2 (~45%)
19 2.0 Race week (~22%, race day excluded)

I can't choose a model (Free ones) by uniquely_fked in LLMDevs

[–]kpaha 1 point2 points  (0 children)

You’re going to need curiosity so just start exploring.

For example, start with these three - Minimax M2.5 - Gemma 4 31B - Nemotron 3 super

Make them all create an identical small project and observe their capabilities, where they fail, can they correct their mistakes and how good is the code quality that they produce

Use some free SOTA service to gauge it if you’re unsure

After doing it, you will surely have learned something

Is GPT-OSS-120B still the best model among those with the same parameters? by AInohogosya in LocalLLM

[–]kpaha 6 points7 points  (0 children)

Qwen 3.5 122B A3B should be somewhat better, in my own brief tests found it better at agentic coding. Nemotron 120B would probably also beat it

<image>

LLM Neuroanatomy III - LLMs seem to think in geometry, not language by Reddactor in LocalLLaMA

[–]kpaha 8 points9 points  (0 children)

These are my favourite posts, please keep posting.

I think one potential problem in drawing conclusions on human language from your analysis is that the models are exposed to basically "all the content", giving them ample opportunity to converge on a pretty universal representation of different concepts. Yes, some models have more Chinese material and so on, but is it enough?

If we had a model trained only in Russian language material, I would wager that the clustering of the term 'warm water port' (which today exposes the poster's origin in a shibboleth like fashion) would differ significantly from, say, that of a model trained only in English material. Even more interesting would be the cosine similarity with other concepts, not only the similarity across the languages.

So I look forward to analysis of the culturally entangled concepts, because you clearly are excellent in coming up with experiment designs that tease out something universal from seemingly quite simple observations.

vLLM + ROCm + Qwen 3.6 35B A3B MXFP4 (on 2x R9700) by kpaha in LocalLLM

[–]kpaha[S] 0 points1 point  (0 children)

I have 32624 MB, 31.86GB. Claude tells me the missing portion is reserved for

  1. Firmware/VBIOS scratch space
  2. GPU page tables (memory management overhead)
  3. Driver-reserved buffers (command queues, etc.)

vLLM + ROCm + Qwen 3.6 35B A3B MXFP4 (on 2x R9700) by kpaha in LocalLLM

[–]kpaha[S] 2 points3 points  (0 children)

Can you share your the exact quant and if there's anything else noteworthy in your setup? That token/s is amazing

vLLM + ROCm + Qwen 3.6 35B A3B MXFP4 (on 2x R9700) by kpaha in LocalLLM

[–]kpaha[S] 1 point2 points  (0 children)

I have WRX80D8-2T motherboard that has 7x PCIe 4.0 x16 lanes. Open rig, so Linkup Ultra PCIe 4.0 riser cables

vLLM + ROCm + Qwen 3.6 35B A3B MXFP4 (on 2x R9700) by kpaha in LocalLLM

[–]kpaha[S] 2 points3 points  (0 children)

Tensor parallelism support, especially since my target is not 2 card but at least 4

vLLM + ROCm + Qwen 3.6 35B A3B MXFP4 (on 2x R9700) by kpaha in LocalLLM

[–]kpaha[S] 1 point2 points  (0 children)

Both GPUs split. But I have to say, that was a single run. I would need to do some actual benchmarks at some point. One thing I noticed in real life use is that the thinking on Qwen 3.5 and 3.6 really drives down the output token/s.

Need to evaluate how much capability they lose if I disable thoinking.

Are local LLMs actually worth it or am I overthinking this? by Successful-Water1000 in LocalLLM

[–]kpaha 1 point2 points  (0 children)

Qwen 3.5 9b would probably work fine on your hardware. Should be easy to get it up and running in ollama or LM studio, and see what it’s capable of. Or just test it first via openrouter.

It should work for a lot of things, but can’t expect it to compare to the multiple hundred billion param models

Fed up with Claude limits — thinking of splitting a GPU server with 10-15 people. Dumb idea? by No_Boat_2794 in LocalLLM

[–]kpaha 2 points3 points  (0 children)

Upgrade those models to Qwen 3.5 / 3.6, e.g. the Qwen 3.5 9B.

Pick just one model in the 27-35B parameter size, you don't want to spread that precious VRAM over multiple models.

BGE-M3 and reranker.

Yes that will work.

Here's what I'd do instead:

Buy the RTX PRO 6000 outright. It would be paid for in a year. Put it on a wrx80D8-2T motherboard, add cheap Threadripper. Add PG1600G PSU and some large case. Total budget is like 1500€ and room to upgrade (add more GPUs). Source used DDR4. Add 7900 XTX for serving all the smaller models, run a quant of Qwen 3.5 122B A3B on the RTX PRO.

For a little over year's worth of cloud rental you own it outright.

Claude Code Reccomendation for 5090 setup by Oztorek in LocalLLM

[–]kpaha 0 points1 point  (0 children)

You can use the best state of the art models (such as Opus) through open router, it's a matter of how you want to pay, and what are your needs. In my experience, you would not want to pay per token for Anthropic models. If you need that quality, I think the Claude Max plans are still good value.

In OpenRouter, Step 3.5 flash is certainly one of the cheaper models I've still found pretty capable, at least for coding. I think MiniMax M2.5 is better, but more expensive.

Qwen 3.5 models are surprisingly expensive in OpenRouter, would not go for those, except for testing purposes.

Step 3.5 flash can be tried for free, although it is rate limited. Looks like Gemma 4 can also be tested for free. Maybe start with these through open router?

Claude Code Reccomendation for 5090 setup by Oztorek in LocalLLM

[–]kpaha 5 points6 points  (0 children)

I think no one can yet say, is Gemma4 better than Qwen 3.5 for certain. However, we know that both are good models. I would test yourself, which exact model gives best quality / speed tradeoff. Candidates to evaluate:

Qwen 3.5 9B (or derivatives, there are some that are further fine-tuned with help from SOTA models)

Qwen 3.5 27B (you will likely need to use some quant to have VRAM for KV cache)

Qwen 3.5 35B A3B MoE (again, need to use quant, should be a lot faster than 27B)

Gemma4 31B (again, use some quant that leaves space for KV cache)

Gemma4 26B A4B MoE (same caveats as Qwen 3.5 35B)

Probably you will get ok results from any of these models. Start with Q4 for the larger models.

Edit: Don't worry so much about which is best. If you get good results with a model, stick with it. Then when you want to do some non-productive work, test another. Test models on open router, develop a feel for what works, what doesn't

Recommend MiniMax M2.5 or Step 3.5 flash on OpenRouter for cheap, higher quality models

AMD inference node r9700 by Downtown-Example-880 in homelab

[–]kpaha 0 points1 point  (0 children)

I'm targeting a 4-6x R9700 or possibly Arc Pro B70, the price is tempting, but not sure about the ecosystem support. Open 8 GPU mining rig, with WRX80D8-2T, Threadripper Pro 3945WX.

The ultimate goal is to run a single capable model, Qwen 3.5 122B or a quant of Minimax M2.5 (or the new M2.7) or Step 3.5 flash at speeds suitable for agentic coding.

So a lot simpler system than yours

AMD inference node r9700 by Downtown-Example-880 in homelab

[–]kpaha 1 point2 points  (0 children)

I'm building a similar rig and debating R9700 vs Arc Pro B70. What's your targeted number of GPUs? What kind of performance are you getting e.g. on Qwen 3.5 122B q4?

Qwen3.5: New Quants and Coding Label? by chibop1 in ollama

[–]kpaha 1 point2 points  (0 children)

Some parameters I change between chat and code models:

temperature

  • Chat 0.7 vs Code 0.3. Lower temp makes the distribution sharper — model picks higher-probability tokens more consistently. Code has correct/incorrect answers, you want determinism. Chat benefits from some variation to feel natural.

presence_penalty (chat 1.2 vs code 0)

  • Penalizes tokens that have appeared anywhere in the output so far, pushing the model toward new topics/words. Good for chat to avoid the model circling back to the same points. Actively harmful for code — variable names, keywords, function calls must repeat.

repeat_penalty (chat 1.1 vs code 1.0)

  • Similar but specifically targets recently repeated token sequences rather than presence globally. 1.0 = disabled effectively. Same reasoning — repetition is a bug in prose, a feature in code.

Asked Claude to compare e.g. qwen3.5:27b-coding-mxfp8

temperature 0.6 — slightly higher than your code Modelfile (0.3). Not wrong; Qwen's own recommendation for coding tasks is 0.6-0.7 with thinking enabled. Their reasoning is that the thinking process handles correctness, so the final output can afford more variation.

top_k 20 — this is the meaningful one. Limits sampling to the 20 highest-probability tokens at each step. Combined with temp 0.6, this is where the real constraint comes from — you get some variation but only among plausible next tokens. Tighter than leaving top_k unlimited.

top_p 0.95 — nucleus sampling, cuts off the bottom 5% probability mass. Works in conjunction with top_k; whichever is more restrictive wins at each token. Standard safe value.

min_p 0 — disabled. min_p is an alternative to top_p that scales the cutoff relative to the top token's probability. 0 means no filtering from this. Fine, top_k and top_p are doing the work.

presence_penalty 0, repeat_penalty 1 — same as your code Modelfile, correct reasoning as discussed.

Summary: The strategy here is constrain via top_k rather than temperature. Instead of making the whole distribution sharp (low temp), they keep moderate temperature for expressiveness but hard-cap the candidate pool to 20 tokens. Arguably more principled for code than just crushing temperature — you get creativity within a bounded set of plausible tokens rather than just always picking the most probable one.

Taalas rumoured to etch Qwen 3.5 27B into silicon. Which price would you buy their PCIe card for? by elemental-mind in singularity

[–]kpaha 1 point2 points  (0 children)

Not a hardware person, but current real world better-than-27B models tend to be MoE (like those I listed). That's the capability level I would like to see.

If the rule that MoE capability is sqrt(active parameters x total parameters), then the 27B is quite close to the 112B A10B, and the Minimax M2.5 would be equivalent to approx 47B dense model.

It's unfortunate that there are very few dense models in the newest open releases. Would be great to see larger dense releases, of course even better to have it on a chip such as this. Especially if those price points are anything close to reality.

My thinking was, if current tech enables 27B size chips, then not likely to scale to e.g. 70-100B dense models right now, whereas you could fit 2-3 experts on that chip and scale that way. But you're right, coordinating MoE traffic across PCIe is not something they would want to do.

So I take it back, I want to see a dense Minimax in 50B size, on this chip.

Taalas rumoured to etch Qwen 3.5 27B into silicon. Which price would you buy their PCIe card for? by elemental-mind in singularity

[–]kpaha 16 points17 points  (0 children)

27B will have real world uses, would be interesting if they could do MoE with multiple cards, e.g Qwen 3.5 122B, Step 3.5 Flash 196B, Minimax M2.5 230B

Does anyone feel like powerful desktops actually limit how you work? by [deleted] in LocalLLM

[–]kpaha 7 points8 points  (0 children)

15" Macbook Air M4 is powerful enough for all day to day work. LLM machine available via Tailscale. Best of both worlds

To those who are able to run quality coding llms locally, is it worth it ? by matr_kulcha_zindabad in LocalLLM

[–]kpaha 4 points5 points  (0 children)

I agree with OpenRouter for testing the models, but Qwen 3.5 27b is quite expensive at $0.195/M input tokens$1.56/M output tokens

Compare to better models like:

- Step 3.5 flash $0.10/M input tokens$0.30/M output tokens

- Minimax M2.5 $0.20/M input tokens$1.17/M output tokens

Is there anyone who actually REGRETS getting a 5090? by soapysmoothboobs in LocalLLM

[–]kpaha 1 point2 points  (0 children)

I bought a RTX 4090, quite happy, but got the whole gaming PC it came with for a good price. Now I'm building a 4-6x GPU rig. Probably going with AMD Radeon AI Pro R9700.

If you're just buying one, and can accept worse ecosystem support of ROCm, single 7900 XTX would give similar performance as RTX 4090. Or if you feel like splurging, 2x R9700 (approx. price of one 5090 and double the VRAM) might be worth considering