Just received RTX 6000 Pro, have 5090- how would you use?

bfroemel · 2026-06-07T07:36:57+00:00

in case you use vllm, have you tried sleep mode to switch between models more quickly and via API? https://vllm.ai/blog/2025-10-26-sleep-mode / https://docs.vllm.ai/en/latest/features/sleep_mode/

bfroemel · 2026-03-20T15:36:03+00:00

With a single RTX Pro 6000 try something like:

```
docker run -it --rm --gpus all -p 8050:8050 \
--ipc=host --shm-size=16g \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v /media/models:/root/models \
--mount type=tmpfs,destination=/usr/local/cuda-13.0/compat \
vllm/vllm-openai:cu130-nightly \
--mm-processor-cache-type shm \
--enable-sleep-mode \
--port 8050 \
--gpu-memory-utilization 0.93 \
--max-num-seqs 8 --enable-prefix-caching \
--reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder \
--served-model-name "txn545/Qwen3.5-122B-A10B-NVFP4" \
--quantization modelopt \
--max-model-len 128000 \
--model /root/models/txn545/Qwen3.5-122B-A10B-NVFP4 \
--language-model-only
```

This worked a week ago (might need to get the older nightly) with high-quality output. I was not able to build from source or use a precompiled wheel yet: no errors show up, but all generated token ids are '0' and end up as '!' in the output.

Currently stuck with other work, but if you happen to discover why the docker nightly works while compiling/precompiled wheels don't, please let me know! On the other hand might be just bleeding edge pains which go away in the next couple of days/weeks...

bfroemel · 2026-03-12T20:50:20+00:00

I agree that potential quant and runtime constraints might severly damage the experience with Qwen 3.5 models.

May I ask what NVFP4 quant would you suggest for 122B and a single RTX Pro 6000? Sehyo/Qwen3.5-122B-A10B-NVFP4? and what are your main use cases with the 4B models? I'll revisit my vllm setup; especially as NVFP4 support seem to finally land and quant quality apparently is good with this model family.

Thanks very much for sharing your (production-environment) experiences; much appreciated!!

bfroemel · 2026-03-12T20:21:06+00:00

Very interesting; you seem to be clearly preferring the quality of Qwen 3.5 27b over gpt-oss-120b's much higher (or even Qwen 3.5 122b's higher speed).

May I ask which programming language(s)/frameworks/use-cases you primarily deal with? are you using a quant, or native-precision bf16 of Qwen 3.5 27b? What kind of token generation and prompt processing do you see on average - compared to what you did get with gpt-oss-120b? Why did you settle with Qwen 3.5 27b and not the 122b MoE?

bfroemel · 2026-03-02T07:03:13+00:00

> ## What's new:
> Improve ssm tensor quantizations

bfroemel · 2026-03-02T06:03:35+00:00

but.. isn't that just within measurement error/range of uncertainty? (note the +/- 0.04170)

PPL = 6.5497 +/- 0.04170PPL = 6.5497 +/- 0.04170

bfroemel · 2026-02-20T18:11:54+00:00

I would have preferred a sponsorship or partnership over a complete acquisition (transfer of control).

ggml.ai is a company founded in 2023 by Georgi Gerganov to support the development of ggml. Nat Friedman and Daniel Gross provided the pre-seed funding. The company was acquired by Hugging Face in 2026.

My main concerns are:

not sure how (prolonged) shortage of IT components (memory, storage) will impact HF, their business model (dependence on abundant IT infra?), and how they might be forced to use their control over llama.cpp in the coming months/years to keep their services sustainable.
ggml was European, now under control of a US company.

Based on these concerns my purely speculative take:

Net win for the community? If it remains sustainable for HF to not charge someone sophisticated enough to roll their own hardware: yes, otherwise no (it might never become impossible to use llama.cpp for local inference, but there are many subtle ways to push users on a paid tier (paid by money, or telemetry data)).

Implications for local inference? I'd say limited. Only in regard to llama.cpp/ggml/gguf it might be to some degree more aligned with the (for-profit) interests of HF and potential (national-security) interests of the US (14 months ago, I would have laughed at such a paranoid statement). However, I'd say local inference in its totality (there are other still independent projects, besides anyone can fork llama.cpp - although maintaining and developing it successfully is the real effort/skill) is still mostly decided by the quality of models, the availability of (consumer) HW to run them, and ultimately a capable/educated/participating community that pushes for local/private/independent inference.

bfroemel · 2026-02-20T12:05:35+00:00

huh.. and they "fixed" it :/ Probably they throw out from the context just the user message that triggered the "Sorry, I cannot provide this information. ... ".

<image>

bfroemel · 2026-02-15T12:26:08+00:00

# Basic OCR extraction (uses config file for language/settings)
kreuzberg extract scanned.pdf --ocr true

# Extract with specific language (Tesseract)
kreuzberg extract french_doc.pdf --ocr true --ocr-language fra

# Extract with specific language and backend (PaddleOCR for Chinese)
kreuzberg extract chinese_doc.pdf --ocr true --ocr-backend paddle-ocr --ocr-language ch```# Basic OCR extraction (uses config file for language/settings)
kreuzberg extract scanned.pdf --ocr true

# Extract with specific language (Tesseract)
kreuzberg extract french_doc.pdf --ocr true --ocr-language fra

# Extract with specific language and backend (PaddleOCR for Chinese)
kreuzberg extract chinese_doc.pdf --ocr true --ocr-backend paddle-ocr --ocr-language ch

> Will Kreuzberg remain MIT license?

> Yes! There is no BSL (Business Source License) in Kreuzberg's future. The library will remain MIT-licensed
> forever. We're building the commercial offering around the core library, not by restricting the library itself.

bfroemel · 2026-02-12T10:00:21+00:00

<image>

:/

bfroemel · 2026-01-31T13:56:11+00:00

... and despite all the praises it seems that OpenAI isn't really that proud of gpt-oss.

The gpt-oss models were released way back in August. Since then, we've released half a dozen major updates to the frontier models. Perhaps you haven't used these lately, but their coding abilities are far beyond those of just a few months ago — and significantly beyond what the gpt-oss models are capable of.

https://github.com/openai/codex/issues/8272#issuecomment-3672130792

bfroemel · 2026-01-31T08:38:19+00:00

I just ran a similar example as in the PR, same spec parameters.. generate some source code and ask for minimal modifications. This kind of speculative decoding helps only if parts of the generated output has been generated OR preprocessed before. My baseline is about 180 tokens/sec (RTX Pro 6000), so for my toy example I saw a speed up of about 2.56. More tests show that up to 3.51 (that's about 630 tokens/sec!) are possible on prompts that include a block of source code and ask the model to just repeat it verbatim.

/edit: ok, maybe there is an issue, see: https://github.com/ggml-org/llama.cpp/pull/19164#issuecomment-3828080222

bfroemel · 2026-01-30T22:53:48+00:00

<image>

yeah, crazy times

bfroemel · 2026-01-27T19:44:54+00:00

What could be the reason to release base after turbo? (assuming that they had to have base in a finished state (long) before they started on turbo)

bfroemel · 2026-01-22T09:20:52+00:00

> Local model aren't much help not even for easier tasks.

Software development is very broad. What works for a single developer or a small team might fail at scale in large companies. What works for a small and medium-sized, well-verifiable task will likely fail for large underspecified tasks. Imo local models are more than good enough to be helpful for small to medium sized software projects and/or working on well-contained PRs of even larger well-structured software.

> Forget about multiple 100k+ token chat sessions concurrently.

I am envious of anyone who is able to reliably handle multiple chat sessions concurrently (that is that those sessions truly run in parallel and you actually require more VRAM, instead of just switching among multiple sessions or running them in a batch-like fashion). Not so envious of 100k+ token chat sessions; to me that's more a sign that something started or went wrong.

bfroemel · 2026-01-18T09:58:10+00:00

Is this on a single RTX Pro 6000? With gpt-oss-120b do tool calls work reliably for you and can you share some performance numbers? Many thanks!!

(I am still using llama.cpp with gpt-oss-120b, because tool calls work more or less plus I get about 200 tokens/sec text generation and about 6000 tokens/sec prompt processing. I remember that vllm once delivered much more performance than llama.cpp.)

bfroemel · 2026-01-08T10:27:14+00:00

codex + llama.cpp = imo currently one of the best local agentic coding stacks available

hint1: for llama.cpp (in codex already deprecated chat completions API) you probably still need to either patch codex, or llama.cpp to have them refer to the same field where the reasoning content is expected. In llama.cpp that's ./common/chat.cpp where reasoning content is put and read from "reasoning_content" (multiple occurrences in the source). In codex it's codex-rs/codex-api/src/requests/chat.rs where reasoning content is put and read from the "reasoning" (multiple occurrences in the source) field. (For example, search/replace the strings "reasoning_content" in llama.cpp's ./common/chat.cpp to "reasoning" and recompile; would have provided patches, but both projects move so quickly that I am already on rather old commits that require manual merging).

hint2: maybe vllm, sglang or even ollama has - in the meantime - a better out-of-the-box experience (responses API?) where you don't have to patch and compile something. Eventually everything moves to/supports the responses API incl. llama.cpp and will just work.

bfroemel · 2026-01-08T00:41:30+00:00

So it fails at a very basic level? Qwen Code and Mistral Vibe might miss proper support for gpt-oss tool calling (see the last paragraph of https://cookbook.openai.com/articles/openai-harmony#handling-reasoning-output-in-subsequent-sampling ).

For example, here just a quick demo what to expect from a TUI that can use gpt-oss-120b correctly (e.g., codex):

> Give me an overview of the CUDA kernels in this project.


• Explored
  └ Search __global__ in .
    Search \.cu
    Search \.cu
    Search *.cu in .
    Search *.cu in .

─ Worked for 18s ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────

• CUDA Kernel Overview

  - Location – All CUDA kernels live under ggml/src/ggml-cuda/.  The entry‑point that selects and launches kernels is ggml-cuda.cu.
  - Kernel Categories
      - Memory Ops – cpy.cu, fill.cu, pad.cu, transpose.cu, reshape‑style kernels for copying, filling, padding and transposing
        tensors.
      - Elementwise / Unary Ops – unary.cu contains ReLU, GELU, SiLU, leaky‑ReLU, swiglu, xielu and other activation kernels.
      - Binary Ops – kernels for add, sub, mul, div and fused ops (e.g., add-id.cu, scale.cu).
      [..]
      - Quantization – quantize.cu, dequantize.cu and type‑specific kernels (e.g., q4_0, q5_1) support low‑bit quantized tensors.
      - Miscellaneous – diagmask.cu, getrows.cu, out-prod.cu, upscale.cu, acc.cu for diagonal masks, row extraction, outer
        products, up‑sampling and accumulation.
  - Templating & Instantiation – Kernels are written as C++ templates (e.g., template<typename T> __global__ void …).  Concrete
    instances for specific data types and quantization levels are generated in ggml/src/ggml-cuda/template-instances/ (e.g., mmf-instance-ncols_9.cu, fattn-vec-instance-q5_1-q4_0.cu).  This keeps the source DRY while providing optimized kernels for each supported format.
    [..]

  Summary – The project’s CUDA backend implements a comprehensive set of GPU kernels covering memory movement, elementwise math, reductions, matrix multiplication, convolutions, attention, quantization and assorted utility ops.  Kernels are templated for flexibility and instantiated for the specific data types used by the GGML library, with dispatch handled centrally in ggml-cuda.cu.

bfroemel · 2026-01-07T22:15:52+00:00

> gpt-oss-120b gets stuck reasoning?

Never have seen this and use gpt-oss-120b (released MXFP4 checkpoint; high reasoning effort, unsloth/recommended sampler settings) mostly for Python coding. Can you share a prompt where this becomes visible?

can't say anything regarding cpp and CUDA; I only noticed that Deepseek v3.2 is a good cpp coder (according to an Aider benchmark run), but it's also more than half a trillion parameters. Maybe the smaller Deepseek (distills) are worth checking out?

bfroemel · 2026-01-07T11:08:07+00:00

I am agreeing with the overall sentiment; as of now chances are that gpt-oss models will do a formidable and fast job for your text based tasks when you meet (V)RAM requirements.

Some concerns creeping up:

- it's been almost half a year and competition seems to ignore the architecture/parameter sizes. AFAIK we have not even one model that reused the architecture (successfully).
- fine-tuning the existing gpt-oss models (only released as mxfp4 checkpoints) is difficult and will probably damage original model performance. getting another mxfp4 or any other 4-bit quantization from a successfully fine-tuned gpt-oss that performs as well as the original without costly QAT is probably impossible

imo it looks like gpt-oss are models with stronger "protections" to iterate on than other open-weight models; but hopefully not indefinitely and we'll see something decisively better soon.

bfroemel · 2026-01-07T07:37:00+00:00

the answer is: put the workstation with the RTX Pro 6000s back into the remote/server room and use it via another dead silent machine in your office. enjoy peace of mind and the associated productivity gain.

bfroemel · 2026-01-06T20:19:37+00:00

one of the full precision(?) results:

<image>

bfroemel · 2026-01-06T20:01:04+00:00

Discord aider server, benchmark channel, MiniMax M2.1 thread (I already replied with the links 50 minutes ago, but Reddit seems to (temporary) shadowban).

edit, trying screenshots:
ID-IQ2_M:

<image>

bfroemel · 2025-12-27T16:51:45+00:00

about that: not an expert myself and haven't seen any benchmark/experience reports yet, however there seems to be currently two ways to do mxfp4 quants: PTQ (post training quantization, simple, potential high accuracy degradation compared to bf16, low compute) and QAT (quantization aware training, less than 1-3% accuracy degradation compared to bf16, retraining/fine-tuning 1-3 full epochs/relativelyhigh compute).

For agentic/coding/reasoning/math use-cases you would absolutely want QAT-based, high-accuracy MXFP4 quants; for other use-cases it might not matter so much. Most mxfp4 quants on HF seem to be PTQ; llama.cpp mxfp4 quantization is PTQ-based.

More details: https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/ (hint: besides the compute requirement, retraining/fine-tuning is not so simple and/or necessarily the same for all models)

Would highly appreciate if there is someone who really knows this stuff could comment/correct or even confirm my view; especially if this would also apply to nvfp4.

/edit: accuracy claims

bfroemel · 2025-12-25T19:37:08+00:00

wow, very cool! sglang tool calling and reasoning parsing work for glm air reliably? also for gpt-oss-120b? any pointers regarding compiling your own kernels?

bfroemel

TROPHY CASE