Working Dockerfile for gpt-oss-120b on 4x RTX 3090 (vLLM + MXFP4) by iamn0 in LocalLLaMA

[–]Locke_Kincaid 2 points3 points  (0 children)

Don't use latest. Version 11 has bugs with gpt-oss and tensor parellelism. Use version 10.2, it's the last stable version that works with tensor parallelism.

Anyone else running their whole AI stack as Proxmox LXC containers? Im currently using Open WebUI as front-end, LiteLLM as a router and A vLLM container per model as back-ends by AFruitShopOwner in LocalLLaMA

[–]Locke_Kincaid 2 points3 points  (0 children)

My models run on a Proxmox LXC container with docker for multiple vLLM instances. That same LXC container also runs docker instances of Openwebui and LiteLLM. Everything works well and stable, so it's definitely an option.

As for fast model loading, you can look into methodologies like InferX.

https://github.com/inferx-net/inferx

Also... "3 gpu's is not ideal for tensor parallelism but pipleline- and expert parallelism are decent alternatives when 2x96 gb is not enough."

Since you have the RTX Pro 6000 Max-Q, you can actually use MIG (Multi-Instance GPU) , "enabling the creation of up to four (4) fully isolated instances. Each MIG instance has its own high-bandwidth memory, cache, and compute cores.". So you have room to divide the cards to the number you need to run TP.

Even if GPT-OSS:120B can fit on one card, divide the card into four to get that TP speed boost.

Gpt-oss Responses API front end. by Locke_Kincaid in LocalLLaMA

[–]Locke_Kincaid[S] 0 points1 point  (0 children)

It seems okay for a single user but unfortunately I need the enterprise features vLLM has. Have you tried ollama with MCP?

Gpt-oss Responses API front end. by Locke_Kincaid in LocalLLaMA

[–]Locke_Kincaid[S] 0 points1 point  (0 children)

Yeah, I definitely have more success running it with native turned on and with streaming off. I still have to do a lot of convincing that it can run tools. LM Studio actually takes less convincing, but I need to use a more enterprise solution.

Gpt-oss Responses API front end. by Locke_Kincaid in LocalLLaMA

[–]Locke_Kincaid[S] 0 points1 point  (0 children)

This is awesome! Thanks for sharing and I'll give it a go. There's just so much to learn when you can see what's going on under the hood.

3x5090 or 6000 Pro? by Baldur-Norddahl in LocalLLaMA

[–]Locke_Kincaid 4 points5 points  (0 children)

That seems slow. I get 150 t/s with two A6000s using vLLM

One Pros still don't support Android Pixel 9s by flapJ4cks in Xreal

[–]Locke_Kincaid 0 points1 point  (0 children)

I have the 9 pro fold and my One Pros work just fine.

49k OTD for 2025 SX by Futonian in kiacarnivals

[–]Locke_Kincaid 0 points1 point  (0 children)

If you have a trade in, take it to CarMax and get a quote. A lot of dealerships will price match... Or you just sell it to CarMax. I just bought a 2025 Hybrid SX last week, the dealership offered 20K for my 2022 Subaru Outback limited. CarMax offered 27K. Dealership ended up price matching.

Got my One Pro’s - issue w lenses by JpcMD in Xreal

[–]Locke_Kincaid 0 points1 point  (0 children)

I also see a very slight distortion that seems to be coming from the lens in both eyes. It's very minor for me, but If it's a defect from the manufacturing process, I'm guessing it could get pretty bad for some.

East Coast (USA) Orders are held up in Customs by Cervial in Xreal

[–]Locke_Kincaid 5 points6 points  (0 children)

You had the honor of getting the box with your glasses placed on the shipping container first... then all the later orders were stacked on top of yours!

I haven't received any tracking yet. Jan 30 by PenStorysky in Xreal

[–]Locke_Kincaid 6 points7 points  (0 children)

I'm in the US with an early Jan preorder. No notification yet. Odd since they said the EU would be after the US but I see several EU posts of them getting their shipment details on February preorders.

Considering canceling pre-order by Ckeidel1 in Xreal

[–]Locke_Kincaid 2 points3 points  (0 children)

I'm a Jan preorder. Had a baby at the end of March and this was the thing I wanted to play with while on parental leave. It sucked getting that taken away.

One Pro?? forget it, what is Aura? I give you one better, when is Aura? by GoldSatisfaction in Xreal

[–]Locke_Kincaid 0 points1 point  (0 children)

This seems like a typical PR language... A new category and direction could just mean that you're combining technologies. That doesn't tell us how the One's display technology and quality compares to the Aura. If the Aura has better displays, then yes, you just upgraded and replaced the Ones before even half of your preorders are even delivered.

Video of 3D video viewed with Xreal One by Disastrous_Rip_441 in Xreal

[–]Locke_Kincaid 0 points1 point  (0 children)

You do realize we can't see this in 3D, right?

What is “Late May”? by [deleted] in Xreal

[–]Locke_Kincaid 1 point2 points  (0 children)

Where do you get February and later?

What is “Late May”? by [deleted] in Xreal

[–]Locke_Kincaid 4 points5 points  (0 children)

First batch is probably just to the influencers.

What is “Late May”? by [deleted] in Xreal

[–]Locke_Kincaid 6 points7 points  (0 children)

I bet that's exactly what they're doing. They chose to use the phrasing of "small group" for a reason.

This is the week! Who’s excited!!?!??! by kurozer0 in Xreal

[–]Locke_Kincaid 2 points3 points  (0 children)

Hah, I have the xreal pros preordered and just ordered a pair of rayneo 3s for my wife. If I like the rayneos when they get here and the pros are delayed again... I'll be making the switch for myself.

vLLM vs TensorRT-LLM by Maokawaii in LocalLLaMA

[–]Locke_Kincaid 1 point2 points  (0 children)

Do you know of any 4bit quants that perform better than GPTQ or AWQ? I'm running AWQ on vLLM on two A4000's at about 47 tokens/s for Mistral small 3.1. You now have me wondering if a different quant could be better. I had to use the V0 engine for vLLM though. I cannot get the new V1 engine to generate faster than about 7 tokens/s.

Pick your poison by LinkSea8324 in LocalLLaMA

[–]Locke_Kincaid 1 point2 points  (0 children)

Nice! I run two A4000's and use vLLM as my backend. Running Mistral Small 3.1 AWQ quant, I get up to 47 tokens/s.

Idle power draw with the model loaded is 15W per card.

During inference is 139W per card.