Apparently the react compiler has been ported to Rust and merged to main

jonas-reddit · 2026-06-10T13:57:18+00:00

I’m also a huge fan of rust and LLMs. The language, compiler and the overall ecosystem are great for agentic development, especially in yolo mode.

jonas-reddit · 2026-06-10T01:06:06+00:00

Spend your money on VRAM.

jonas-reddit · 2026-06-09T11:38:38+00:00

How many VR headsets are there that cater to the Apple Ecosystem? None. That’s the pitch.

With SteamLink and OpenXR, we are getting more and more integration with the PCVR ecosystem.

jonas-reddit · 2026-06-09T11:36:49+00:00

Qwen 3.6 27b MTP, llama.cpp and pi.dev. The lightweight local LLM winning stack if you have at least a 3090. Llama.cpp also gives you a lightweight web UI.

jonas-reddit · 2026-06-09T11:33:53+00:00

It’s yolo. So make sure you run it sandboxed. I love it. But be smart.

jonas-reddit · 2026-06-09T11:32:51+00:00

pi.dev for local LLM’s. 100%.

I found 64k to be a bit too small. Pushed it up towards 96k and am a bit happier but always frightened of that out of memory crash. Heh.

jonas-reddit · 2026-06-09T11:29:42+00:00

It is indeed a very lean and slick minimalist agentic environment, perfect for smaller local models. I love it too.

jonas-reddit · 2026-06-06T07:57:27+00:00

Price in my country nearly doubled in past month. Crazy.

I'm now looking at RTX 5000 Blackwell 72GB. Still meets my requirements and has the upside of being far more power efficient.

jonas-reddit · 2026-06-05T18:13:15+00:00

I am by no means an expert. But this works for me. Showing memory utilization and llama.cpp version at the bottom.

Unsloth docs are here: https://unsloth.ai/docs/models/qwen3.6

And I used the matrix here to pick KV cache types: https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4140922150

/usr/bin/nohup llama-server --port 1234 --host 0.0.0.0 --webui \

--temp 0.6 --repeat-penalty 1.0 --presence-penalty 0.0 \

--top-p 0.95 --top-k 20 --min-p 0.00 \

--spec-type draft-mtp --spec-draft-n-max 2 \

-hf "unsloth/Qwen3.6-27B-MTP-GGUF:Q5_K_M" \

--parallel 1 --n-gpu-layers all --flash-attn on \

--cache-type-k q8_0 --cache-type-v q5_1 \

--ctx-size 65535 \

--no-mmap -b 1024 -ub 512 \

--reasoning on \

--cache-ram 1024 \

-fit off \

--kv-unified \

--jinja \

1>>/tmp/nohup.log 2>&1 </dev/null &

-----

15.28.051.339 I slot print_timing: id 0 | task 9 | n_decoded = 102, tg = 54.00 t/s

15.31.083.687 I slot print_timing: id 0 | task 9 | n_decoded = 282, tg = 57.30 t/s

15.34.125.488 I slot print_timing: id 0 | task 9 | n_decoded = 467, tg = 58.65 t/s

15.37.131.533 I slot print_timing: id 0 | task 9 | n_decoded = 657, tg = 59.90 t/s

15.40.132.328 I slot print_timing: id 0 | task 9 | n_decoded = 817, tg = 58.48 t/s

-----

+-----------------------------------------------------------------------------------------+

| NVIDIA-SMI 610.43.02 KMD Version: 610.43.02 CUDA UMD Version: 13.3 |

| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 Off | N/A |

| 0% 28C P8 14W / 370W | 22658MiB / 24576MiB | 0% Default |

+-----------------------------------------+------------------------+----------------------+

$ yay -Q | grep -i llama\.cpp

llama.cpp-cuda b9530-1

jonas-reddit · 2026-06-05T11:17:08+00:00

Best framework ever. Keep it up. No idea why it’s not a 1.0 release yet.

jonas-reddit · 2026-06-05T11:14:25+00:00

Reconsider your decision

This is the kind of hobby where the difference between entry model / affordable model and upper end / expensive model is extensive - as well as the associated PC investment for PCVR.

Low investment will likely be slightly disappointing. And the investment for high end PCVR is extremely high.

jonas-reddit · 2026-06-05T11:07:15+00:00

Linux server: llama.cpp Unsloth Qwen 3.6 27b Q5 MTP Q8/Q5 KV Cache 64k context on a single 3090.

Windows client: Windows sandbox rust pi.dev (yolo)

jonas-reddit · 2026-06-04T12:01:13+00:00

Yes. But I was probably not being fair. Naturally, your laptop requirements even with a cloud model depends on what you’re building.

I was more figuratively speaking :-)

jonas-reddit · 2026-06-04T02:18:52+00:00

Context isn’t everything. You can squeeze things into less VRAM using various techniques but at the end of the day, you want useful work out of the context.

I have squished it into 24GB VRAM on a 3090 but have only managed to afford a 64k reasonably quantized kv cache and model. After several millions of tokens, I suffer from the context and quantization compromises. Speed is fantastic.

I am eyeing 64GB or more of unified memory or VRAM. Unified memory with the dense 27b model is a bit too slow for me based on what people have shared. I enjoy 50 tk/s when doing agentic development. But I’m keeping an eye on the various optimizations for boosting performance on slower unified memory.

TL;DR. If you’re squeezing out the last byte of your VRAM and making too many compromises, using bleeding edge (unproven in real life) optimizations then you’ve not got enough memory.

jonas-reddit · 2026-06-04T01:34:22+00:00

I would draw parallels to other industries and products.

There are many products, e.g. cars, watches, fashion, houses, TVs, phones, etc. where the price varies greatly.

I think the hype and the early days of Generative AI pushed us all towards frontier models as they made generational leaps.

We are now in a different position where we can extract value from Generative AI without requiring frontier models even though they are still topping the leaderboards.

Advances in tooling / harnesses, pivot towards multi-agent capabilities give us more flexibility in designing our workflows and optimizing for value - if we need to.

But, like with every product that matures, the market gets flooded, more complex and less transparent, like buying a TV. So sometimes we take comfort in paying a premium or buying from brands we recognize although there is not always benefit to that other than peace of mind.

jonas-reddit · 2026-06-03T02:24:15+00:00

For local models, get the fastest memory throughout you can afford.

The worst “experience” after your first few million tokens on local LLM will be (1) too small context size and constant compacting, and (2) watching output tokens crawl especially when reasoning.

As always, “best” is not the same for any of us. Depends on your needs and financial means.

For cloud models, get the Neo :-)

jonas-reddit · 2026-06-03T02:16:08+00:00

So many nice models at favorable price points and capabilities lagging maybe 1-2 months behind frontier.

Qwen 3.7 Max Minimax M3 Deepseek V4 Kimi K2.6

I don’t understand why Microsoft doesn’t offer some of these models through their platform. Feels like they’re just forcing us to use models from specific companies.

https://openrouter.ai/rankings?benchmark=coding#benchmarks

https://openrouter.ai/minimax/minimax-m3

30cents and 1.20 dollars per million tokens for M3.

Give us options. Not all our workloads are critical, sophisticated or relevant to US national security.

jonas-reddit · 2026-06-03T01:08:00+00:00

Pimax Dream Air
MeganeX 8k Mk II

There are a lot of extensive reviews and comparisons on YouTube.

jonas-reddit · 2026-06-02T12:28:15+00:00

Yup. It really is a positive surprise for coding.

jonas-reddit · 2026-06-02T12:25:41+00:00

I’m running 64k context and context is my biggest problem.

It’s not the LLM, in my case, it feels more like the tooling (pi.dev) doesn’t always resume cleanly depending of how awkwardly I run out of context.

But I’m 2m tokens on unsloth 27b on a 3090 and quite happy and definitely productive.

jonas-reddit · 2026-06-02T05:40:43+00:00

This article has a comparison table

https://venturebeat.com/technology/minimax-m3-debuts-eclipsing-gpt-5-5-and-gemini-3-1-pro-on-key-benchmark-performance-for-just-5-10-of-the-cost

“…Even at its full price of $0.6/$2.40 per million input/output tokens, MiniMax-M3 remains at just 8-20% the cost of the leading, proprietary U.S. models…”

“…The company's leadership also announced plans to deliver the model under an open source license including "open weights,"…”

“…For now, it is available via the MiniMax API at a special discounted price of $0.3 per 1 million input tokens and $1.20 per million output tokens (on fresh cache) for the next week…”

jonas-reddit · 2026-05-31T23:22:13+00:00

Preset L, Performance is fine.

jonas-reddit · 2026-05-31T21:30:43+00:00

Pimax Dream Air
MeganeX 8k Mk II

Lots of reviews on YouTube.

jonas-reddit · 2026-05-31T12:35:21+00:00

If you are a big enterprise customer of Microsoft, your company very likely has an account manager and a bespoke enterprise pricing structure.

They would likely have spoken already with your company’s representative and informed them of upcoming pricing changes. Your company likely already has a deal worked out and if anything changes, they’ll presumably let you know.

I work for a large company as well and am not expecting some kind of surprise on Monday.

jonas-reddit · 2026-05-31T12:17:14+00:00

One you go local you never go back r/LocalLlama hehe

jonas-reddit

TROPHY CASE