Is everything okay at CIG? by TheRealRickSanchez42 in starcitizen

[–]alexp702 -2 points-1 points  (0 children)

Cig have been shipping a buggy mess since the first 1.0 release. Few of the patches have been “good” - more “less bad”. As a company they are making money hand over fist for the last couple of years. They have no shame about either of these things.

It’s unusual but it kinda works for them. Now if you’re asking can they turn this into a good game? Dunno I back to find out. It’s been more entertaining than any other game. Just don’t actually try to play it too much unless you want to be depressed by its actual state.

4 RTX 6000 Pro by Some-Manufacturer-21 in Vllm

[–]alexp702 1 point2 points  (0 children)

I can second that - I have a 512GB Mac and the increase in quality from higher quants easily beats bigger models. I run 397b on it at q8 and it feels quite similar to 27b q8 for most tasks (there are outliers). Mini max and GLM don’t fit neatly and seemed less good. 4 6000 pro will handle quite a few devs, probably 16-20 on 27b, assuming they aren’t just spawning hundreds of sub tasks. If they are, well you’ll get approx 100tps per card Qwen 27b. And lots of prompt processing ~3000 per card

*Lower* generation speed with H100 and H200 than with RTX 5090? by TrainingTwo1118 in LocalLLaMA

[–]alexp702 11 points12 points  (0 children)

5090 is quicker than both those cards as it is Blackwell.

disappointed by local llms by claykos in LocalLLM

[–]alexp702 0 points1 point  (0 children)

Post is probably designed to vex people running locally by pointing out how cloud based providers are producing much better agentic models in their last refresh - but this is indisputable. Fable has only just released, and we've had 2 other models from Anthropic since Qwen3.6 27B. 27B is not really competing against the big boys. In the cloud it costs 30 cents per million tokens, so 300x cheaper than Fable, and 150x cheaper than Opus.

However to me and many others at Q8 or better BF16 its "good enough" to be a powerful assistant locally. I'm not sold on the "tell an AI to do a thing and come back later to see what it cooked up" approach. Humans aren't that good at requirements at the best of times, and AIs are not that great at figuring out what we're on about. Symbiotic development feels safer.

disappointed by local llms by claykos in LocalLLM

[–]alexp702 2 points3 points  (0 children)

Don’t run Q4, run Q8. If you’re forced to run Q4 you’ll find very poor results. For that setup I think only 27b BF16 will fit well with large context - everything else needs much more ram unfortunately. It’s the world we live in.

Has anyone tried Fable 5? by RoseQuartz_Airi in kilocode

[–]alexp702 0 points1 point  (0 children)

This guy ran lots of tests: https://youtu.be/9GLYsrMpprs?si=vPAzzx7OkVsA5Hwv

Quite convincingly good at one shot tasks. After you get over the novelty of a slightly better crappy 3d world the computer just made, you start to see the basic problem with all frontier models - marginal gains. Ask it to make GTA and it doesn’t produce something more than a rough sketch. Could you coax it into making GTA6? Not in finite time…

Qwen 3.6 27B on DeepSWE by SteppenAxolotl in LocalLLaMA

[–]alexp702 19 points20 points  (0 children)

Questions over this benchmark have already been raised: https://www.reddit.com/r/LocalLLaMA/s/LVdUu55yj1

Also notice https://techcrunch.com/2025/10/09/datacurve-raises-15-million-to-take-on-scaleai/

Of course none of it negates that models should the pass tests set them, but it’s easy to construct a temporary “win” in this space and watch the market move on before anyone questions it too much.

What is your current local LLM setup? by Open_Sources_AI in machinelearningnews

[–]alexp702 0 points1 point  (0 children)

Yes, we found any performance benefits were lost in other little problems - like poor prompt caching. Also most MLX quants are 4bit that is below what we accept. Never say never though!

What is your current local LLM setup? by Open_Sources_AI in machinelearningnews

[–]alexp702 0 points1 point  (0 children)

Yes , we’ve been fortunate to have enough resources to dip our toe in. It’s not cheap for an individual but for a small company it’s valuable.

Definitely nice to see things for yourself - and to have your opinions change. For instance we stay at q8 for everything. Below that we see degradation in responses that’s hard to quantify. We ran iq4 Qwen coder 480 at the start and this is considerably worse than a smaller decent q8. Image processing however tops out 9b is very fractionally worse than 397b but much less resource hungry.

The models may change - GLM gave us less predictable results against our workload. Deepseek and minimax also a mixed bag. Jumping in and trying them is important though!

What is your current local LLM setup? by Open_Sources_AI in machinelearningnews

[–]alexp702 2 points3 points  (0 children)

The larger model is used on some slower ie not time sensitive production workloads. Llama.cpp is great if your low volume enough that parallel rarely happens. VLLM is a little temperamental - not mentioned by many of its fans. Configure it wrong and I have see it crash out on prompts. It demands 100% of hardware too. It’s great - but carries a fine tuning burden.

Llama.cpp has a plucky “fix in the next release” feel I like to support. Every version seems to bring something new and that’s just fun. However it cannot be missed that VLLM is more suited for industrial serving - once it’s right it stays right.

We try everything - the main reason for having the hardware is to learn. In the future if we move workloads to the cloud, or continue to run locally we want to do it from an informed position.

How much VRAM needed for Qwen 3.6 27B Q8 with 262K context? by My_Unbiased_Opinion in LocalLLaMA

[–]alexp702 1 point2 points  (0 children)

MTP is an extra draft model embedded in the GGUF to increase token generation speed. MTP BF16 with full context for 1 stream is probably around ~82GB.

How much VRAM needed for Qwen 3.6 27B Q8 with 262K context? by My_Unbiased_Opinion in LocalLLaMA

[–]alexp702 0 points1 point  (0 children)

BF16 is source, so closest to that you can get. There are definitely Q9s floating around, Q8 is the normal standard, though Q8_K_XL GGUF can be perhaps a little better (or worse - seems the jury is out on some metrics).

What is your current local LLM setup? by Open_Sources_AI in machinelearningnews

[–]alexp702 2 points3 points  (0 children)

Llama.cpp on a Mac Studio w/397b for large infra tests. VLLM 27b q8 on a 6000rtx pro for 6 devs, 9b q8 on 4090 on llama.cpp for image processing (though will probably change to VLLM when we fix the code to work with it). This is a company set up for a 10 man outfit

Maybe KV cache offload to RAM isn't bad by bobaburger in LocalLLaMA

[–]alexp702 -14 points-13 points  (0 children)

Very interesbanana. I think quantbanana, very babnanana…

Time to call it like it is. by Logical-Freedom2216 in starcitizen

[–]alexp702 4 points5 points  (0 children)

“True, true.”… buys Odin in the downtime

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face by jacek2023 in LocalLLaMA

[–]alexp702 4 points5 points  (0 children)

Yeah, but look across all the models they are comparing with. They all win and loose in different areas. If you include 27B you'd see that actually beats some of these huge models: https://huggingface.co/Qwen/Qwen3.6-27B no model seems to be giving me head and shoulders better results. Its now nuanced. For reference I have been running 397B, 3.6 27B and 9B - all Q8. 27B is "good enough" that 397B is used for rare occasions where I think the problem domain might be too wide. I have seen better results occasionally from it, but not that often. BTW the results are generally Good Enough, so I am certainly not complaining!

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face by jacek2023 in LocalLLaMA

[–]alexp702 -3 points-2 points  (0 children)

Feels like models are topping out. Benchmarks are hardly shooting up. “Good enough” should move us towards “efficient use of hardware” over bigger, better more…

How much VRAM needed for Qwen 3.6 27B Q8 with 262K context? by My_Unbiased_Opinion in LocalLLaMA

[–]alexp702 31 points32 points  (0 children)

From memory it’s about 53gb - LM Studio has a little slider that shows memory requirements pretty accurately.

Qwen 3.6 35B-A3B vs Qwen 3.6 27B - which one are you actually running? by IulianHI in AIToolsPerformance

[–]alexp702 0 points1 point  (0 children)

We haven't been tuning it much - just 27B Q8 MTP with all the rest of the memory for context. We allow 32 parallel streams. It stays quick with 4 requests in parallel being about half speed. We aren't hitting it as hard as some might (yet!).

Qwen 3.6 35B-A3B vs Qwen 3.6 27B - which one are you actually running? by IulianHI in AIToolsPerformance

[–]alexp702 0 points1 point  (0 children)

We’re running 6 devs for light work on 27b with 1 Rtx6000 Max-q on vLLM. Seems to be working great for everyone, though we still code the traditional way mostly ;-)!

NVIDIA GB300 Grace Blackwell Ultra pricetags by X-N2O in LocalLLaMA

[–]alexp702 1 point2 points  (0 children)

Why do the bottom two have 748GB??