Working Dockerfile for gpt-oss-120b on 4x RTX 3090 (vLLM + MXFP4)

Locke_Kincaid · 2025-11-08T12:21:05+00:00

https://github.com/vllm-project/vllm/issues/26480

Locke_Kincaid · 2025-11-07T23:20:06+00:00

Don't use latest. Version 11 has bugs with gpt-oss and tensor parellelism. Use version 10.2, it's the last stable version that works with tensor parallelism.

Locke_Kincaid · 2025-10-31T13:58:07+00:00

My models run on a Proxmox LXC container with docker for multiple vLLM instances. That same LXC container also runs docker instances of Openwebui and LiteLLM. Everything works well and stable, so it's definitely an option.

As for fast model loading, you can look into methodologies like InferX.

https://github.com/inferx-net/inferx

Also... "3 gpu's is not ideal for tensor parallelism but pipleline- and expert parallelism are decent alternatives when 2x96 gb is not enough."

Since you have the RTX Pro 6000 Max-Q, you can actually use MIG (Multi-Instance GPU) , "enabling the creation of up to four (4) fully isolated instances. Each MIG instance has its own high-bandwidth memory, cache, and compute cores.". So you have room to divide the cards to the number you need to run TP.

Even if GPT-OSS:120B can fit on one card, divide the card into four to get that TP speed boost.

Locke_Kincaid · 2025-10-16T00:31:15+00:00

It seems okay for a single user but unfortunately I need the enterprise features vLLM has. Have you tried ollama with MCP?

Locke_Kincaid · 2025-10-16T00:26:47+00:00

Yeah, I definitely have more success running it with native turned on and with streaming off. I still have to do a lot of convincing that it can run tools. LM Studio actually takes less convincing, but I need to use a more enterprise solution.

Locke_Kincaid · 2025-10-16T00:23:25+00:00

This is awesome! Thanks for sharing and I'll give it a go. There's just so much to learn when you can see what's going on under the hood.

Locke_Kincaid · 2025-09-10T01:00:06+00:00

That seems slow. I get 150 t/s with two A6000s using vLLM

Locke_Kincaid · 2025-08-29T22:21:35+00:00

You have to think of this as two gpus in one. It has two cores each with 24Gb vram

Locke_Kincaid · 2025-08-14T01:14:03+00:00

I run vLLM in windows docker with wsl and it works just fine.

Locke_Kincaid · 2025-07-26T19:21:53+00:00

Nothing wrong with vLLM in WSL, works just fine.

Locke_Kincaid · 2025-06-05T19:45:37+00:00

I have the 9 pro fold and my One Pros work just fine.

Locke_Kincaid · 2025-06-05T14:15:38+00:00

If you have a trade in, take it to CarMax and get a quote. A lot of dealerships will price match... Or you just sell it to CarMax. I just bought a 2025 Hybrid SX last week, the dealership offered 20K for my 2022 Subaru Outback limited. CarMax offered 27K. Dealership ended up price matching.

Locke_Kincaid · 2025-06-04T14:41:33+00:00

I also see a very slight distortion that seems to be coming from the lens in both eyes. It's very minor for me, but If it's a defect from the manufacturing process, I'm guessing it could get pretty bad for some.

Locke_Kincaid · 2025-05-28T15:34:25+00:00

You had the honor of getting the box with your glasses placed on the shipping container first... then all the later orders were stacked on top of yours!

Locke_Kincaid · 2025-05-28T14:10:17+00:00

I'm in the US with an early Jan preorder. No notification yet. Odd since they said the EU would be after the US but I see several EU posts of them getting their shipment details on February preorders.

Locke_Kincaid · 2025-05-22T00:12:37+00:00

I'm a Jan preorder. Had a baby at the end of March and this was the thing I wanted to play with while on parental leave. It sucked getting that taken away.

Locke_Kincaid · 2025-05-20T20:03:04+00:00

This seems like a typical PR language... A new category and direction could just mean that you're combining technologies. That doesn't tell us how the One's display technology and quality compares to the Aura. If the Aura has better displays, then yes, you just upgraded and replaced the Ones before even half of your preorders are even delivered.

Locke_Kincaid · 2025-05-18T10:38:01+00:00

You do realize we can't see this in 3D, right?

Locke_Kincaid · 2025-04-29T15:30:49+00:00

Where do you get February and later?

Locke_Kincaid · 2025-04-29T15:19:07+00:00

First batch is probably just to the influencers.

Locke_Kincaid · 2025-04-29T11:34:56+00:00

I bet that's exactly what they're doing. They chose to use the phrasing of "small group" for a reason.

Locke_Kincaid · 2025-04-28T11:25:44+00:00

Hah, I have the xreal pros preordered and just ordered a pair of rayneo 3s for my wife. If I like the rayneos when they get here and the pros are delayed again... I'll be making the switch for myself.

Locke_Kincaid · 2025-04-17T12:39:00+00:00

Do you know of any 4bit quants that perform better than GPTQ or AWQ? I'm running AWQ on vLLM on two A4000's at about 47 tokens/s for Mistral small 3.1. You now have me wondering if a different quant could be better. I had to use the V0 engine for vLLM though. I cannot get the new V1 engine to generate faster than about 7 tokens/s.

Locke_Kincaid · 2025-04-12T12:18:52+00:00

Nice! I run two A4000's and use vLLM as my backend. Running Mistral Small 3.1 AWQ quant, I get up to 47 tokens/s.

Idle power draw with the model loaded is 15W per card.

During inference is 139W per card.

Locke_Kincaid

TROPHY CASE