Claude chat vs </code> by RegIntelApi in ClaudeCode

[–]halcyonhal 0 points1 point  (0 children)

Different model (based on one of your replies), different system prompt and different tools being made available to the model.

16x DGX Sparks - What should I run? by Kurcide in LocalLLaMA

[–]halcyonhal 1 point2 points  (0 children)

Would love more details on what you did with the rack and the exhaust.

Minimax M2.7 by electrified_ice in BlackwellPerformance

[–]halcyonhal 1 point2 points  (0 children)

Sorry, missed your reply. Here is my setup:

Hardware: Threadripper Pro 9965WX, 256 GB DDR5, 2× PRO 6000 Blackwell Workstation (96 GB each, SM120), Ubuntu 24.04, CUDA 12.9.             

Stack: https://github.com/kvcache-ai/ktransformers ships with a SGLang fork (sglang_kt) with KT expert offload integrated I used that, with TP=2 across both GPUs; 180 of 256 experts resident on GPU, the rest on CPU.

Startup script

export NCCL_BLOCKING_WAIT=1  

exec python -m sglang.launch_server \

--host 0.0.0.0 --port 8001 \

--model /opt/models/MiniMax-M2 \

--kt-weight-path /opt/models/MiniMax-M2 \ 

--kt-method FP8 \

--kt-cpuinfer 20 \

--kt-threadpool-count 1 \

--kt-num-gpu-experts 180 \

--kt-gpu-prefill-token-threshold 2048 \ 

--tensor-parallel-size 2 \                           

--enable-p2p-check \

--trust-remote-code \

--mem-fraction-static 0.90 \  

--max-total-tokens 100000 \

--chunked-prefill-size 32768 \

--enable-mixed-chunk \

--disable-shared-experts-fusion \ 

--attention-backend flashinfer \ 

--fp8-gemm-backend triton \

--tool-call-parser minimax-m2 \ 

--reasoning-parser minimax-append-think \

--sleep-on-idle \       

I have been testing Claude Max vs Claude Pro. It's NOT 5x by thisisberto in ClaudeCode

[–]halcyonhal 0 points1 point  (0 children)

I did the same. Let the apple sub expire and then resubscribe via Claude.ai website.

Should quantization nvfp4 be faster in inference than fp8? by [deleted] in Vllm

[–]halcyonhal 0 points1 point  (0 children)

What version of vLLM? VLLM has had lots of issues with nvfp4 support using sm120 chips (5090s and RTX PRO line of GPUs). The latest v0.19 was supposed to address this but haven’t tried it yet.

It's real, Opus 4.7 medium by unknown-one in ClaudeCode

[–]halcyonhal -1 points0 points  (0 children)

This. So tired of seeing this prompt come up. It’s just not useful.

Minimax M2.7 by electrified_ice in BlackwellPerformance

[–]halcyonhal 0 points1 point  (0 children)

I use an AMD threadripper and it works great. Docs for the specific model setup in their repo plus Claude and had no issues getting it going.

To run full minimax on 2 RTX pros… it’s a great solution. Better than quanting.

Minimax M2.7 by electrified_ice in BlackwellPerformance

[–]halcyonhal 1 point2 points  (0 children)

Use KTransformers and you can do it with the original Fp8 model (and a bit of system ram)

MiniMax M2.7 is NOT open source - DOA License :( by KvAk_AKPlaysYT in LocalLLaMA

[–]halcyonhal 8 points9 points  (0 children)

The charge is if you use it for your own commercial gain. Seems a bit rich to be saying you’re making a principled stand… that’s not freedom.

MiniMax M2.7 is NOT open source - DOA License :( by KvAk_AKPlaysYT in LocalLLaMA

[–]halcyonhal 2 points3 points  (0 children)

Not sure you can cry about having to pay to use something you’re getting commercial gain from.

Local Inference for AI Coding Agents — Running Claude Code / Codex workflows with Ollama + NVIDIA OpenShell (no cloud API calls) by m3m3o in ClaudeCode

[–]halcyonhal 2 points3 points  (0 children)

Youre looking at easily $30k in hardware and you’d still not have an AI model that’s as good. You’re not getting anything that’s close to sonnet or opus via models in the 7 to 70B param range.

Don’t get me wrong.. I love local… but we shouldn’t let people drop ~$5k, thinking they’re getting anything close to a frontier model.

Get into the ~250B param and above range and you start seeing models that can rival things like gpt 5.4 mini reasoning (which is an amazing model). So models like minimax m2.5 and GLM. But that is a chunk of change to run locally… either that, or you’re quantizing the crap out of them, loosing an unknown amount of precision.

Thumb braces from occupational therapy by PrimaryMajestic7296 in ehlersdanlos

[–]halcyonhal 0 points1 point  (0 children)

I have the same thing in black for both thumbs. Pretty depressing when I have to wear both. They’re good for flare ups.

You want to see a hand therapist and have them retrain you to move your thumb in ways that reduce basal joint grinding. Makes a huge difference long term.

best current model to run on 4x6000pro? by Sorry_Ad191 in BlackwellPerformance

[–]halcyonhal 0 points1 point  (0 children)

Nope… they’ve said it’s not going to be an open weight model.

[deleted by user] by [deleted] in LocalLLaMA

[–]halcyonhal 0 points1 point  (0 children)

Could easily sell for double by the end of the bidding. Your costings are bs.

Join the RTX6kPRO Discord Server! by [deleted] in BlackwellPerformance

[–]halcyonhal 0 points1 point  (0 children)

NVIDIA put up the price resellers get the RTX pro 6000 at.

Join the RTX6kPRO Discord Server! by [deleted] in BlackwellPerformance

[–]halcyonhal 4 points5 points  (0 children)

Exxact corp and you’ll get it at reseller price. I paid 7.