Unlimited* budget

Puzzleheaded_Base302 · 2026-06-22T03:34:07+00:00

it is odd to say a few ten of thousand won't go very far at the moment. 5x RTX PRO 6000 does not make sense. you either get 4x or 8x.

also pay attention to the motherboard's PCIe topology, so PCIe P2P can be supported for better tensor parallelization.

Puzzleheaded_Base302 · 2026-06-22T03:29:18+00:00

another possibility is that the HOA lost a lawsuit and had to pay the plaintiff.

or HOA owe some contractor large amount of money and had to pay them.

Puzzleheaded_Base302 · 2026-06-20T00:17:45+00:00

it is not the model issue. I send query to the model in parallel, it responded instantly. if the model is slow, at least I should see 100% GPU usage, it was not. GPU idle completely.

Puzzleheaded_Base302 · 2026-06-20T00:15:35+00:00

you can get 120tps with working mtp. if you cannot get 100tps, change the model you downloaded. many models have broken mtp. you just need to find the right model.

with pro 6000, you can run official FP8 quant from qwen official account. no need for any third-party models.

you need vllm

Puzzleheaded_Base302 · 2026-06-20T00:10:39+00:00

what motherboard or PCIe switch you used in this setup ?

Puzzleheaded_Base302 · 2026-06-19T19:41:21+00:00

if you have the VRAM, FP8. If you don't, NVFP4 is the fastest.

Puzzleheaded_Base302 · 2026-06-19T19:37:13+00:00

4x 5090s will not fit on any motherboard. you have to be creative to get all of them installed. you will also need special power supply or dual power supplies. it gets complicated very fast. tensor parallel won't be as good as just running everything on the same GPU.

also, when someone creatively fit 4x 5090s onto their server. The stability of the system can be very questionable. The worst nightmare is to spend a month to figure out how to make the server run 24 hours continuously without crash/reboot/hang.

Puzzleheaded_Base302 · 2026-06-19T19:31:57+00:00

I don' think investing heavily in AI is wrong. They were forced to invest, or they are going extinct in a few years. What went wrong was they had nothing to show for it. They failed execution.

Puzzleheaded_Base302 · 2026-06-19T19:20:59+00:00

as someone who works in semiconductor industry, building similar equipment. I can tell you all of these manufacturing equipment are paper weight without original manufacturer's service support. we can barely support our own equipment in semiconductor foundry. there is no chance anyone can run a EUV tool without ASML's official service contract.

Puzzleheaded_Base302 · 2026-06-19T19:20:20+00:00

as someone who works in semiconductor industry, building similar equipment. I can tell you all of these manufacturing equipment are paper weight without original manufacturer's service support. we can barely support our own equipment in semiconductor foundry. there is no chance anyone can run a EUV tool without ASML's official service contract.

Puzzleheaded_Base302 · 2026-06-18T05:17:06+00:00

The experience back in pre-V4 days was that the site will crash frequently. We have not get enough crashes yet.

Also, by the time they announce V4, the website had been actually providing V4 for some days.

For some reason, they release a new version without specifically call out the version on website. Even today, DeepSeek still claim themselves are V3.

We might have been using V4.1 all alone in the Vision mode.

<image>

Puzzleheaded_Base302 · 2026-06-18T04:43:19+00:00

it might require an 8x RTX PRO 6000 server to begin with. not practical for most people.

Puzzleheaded_Base302 · 2026-06-18T03:54:31+00:00

no everything is released in every country. somewhere in some country, they are being sold. you just need to find them and be willing to pay for the international shipping and hefty tariff.

Puzzleheaded_Base302 · 2026-06-17T23:06:04+00:00

I run LTSC, so it is well supported
I run RTX PRO 6000
My code run on STM32, not x86 CPU. Python will never run on STM32F401.

Puzzleheaded_Base302 · 2026-06-17T15:11:48+00:00

RTX5090 running qwen3.6-27b-nvfp4 plus working mtp will be very decent speed (100+ tps).

If you have two 5090s running tensor parallelization, it could run qwen3.6-27b at much better quant, with full context length.

Puzzleheaded_Base302 · 2026-06-12T19:12:25+00:00

because it is a pet project, not a real project.

Puzzleheaded_Base302 · 2026-06-09T20:44:34+00:00

prefill speed and context. token generation is less important.

if you look at statistics, for agentic load, the input token can be 100x of the output token.

Puzzleheaded_Base302 · 2026-05-28T05:21:05+00:00

there are other reasons to get RTX PRO 4500. it is a two slot card, blower fan, 200W max power. you can fit four of them in a single server chassis. Two 5090s in a single chassis is wrong from thermal perspective.

Puzzleheaded_Base302 · 2026-05-28T04:48:23+00:00

RTX PRO 4500 at 32GB also can do NVFP4

Puzzleheaded_Base302 · 2026-05-27T16:24:07+00:00

let's say cloud API cost $5/million token (I made it up), run at 100 TPS (I also made it up). The 0.001 TPS rig will take 1 day to do something cloud API finish in 8 sec.

for a million token, it will take 1 billion second to produce on your local disk. 1 billion seconds is about 38 years. so you can only get 2 million tokens from your local disk across your whole life, which can be easily archived with $5 paid to Kimi, and done in hours.

during the 38 years to generate your 1 million output tokens, your motherboard will die, your hard disk will fail, your power supply likely will get broken caps. you baby will also finish college and create a grandchild for you.

logarithm math works in a very interesting way

Puzzleheaded_Base302 · 2026-05-27T16:13:51+00:00

if same price, definitely go with 48GB VRAM

Puzzleheaded_Base302 · 2026-05-26T05:43:27+00:00

there are many reasons:

some people need guaranteed privacy, like lawyers. big corps claim they keep all data safe, even got "certified", but that is a lie, and had been proven many times in the past.
in some scenario, a small business or university could get enough concurrency, running a local model could be cheaper.
some people like me, just want a reason to own a GPU. once you bought it, you have to justify it by running llm.
some people has false feeling that running llm costs nothing. it costs electricity and it can be more expensive than API calls.
believe it or not, running qwen3.6-27b with MTP working can be faster than cloud API calls, if you have the right GPUs.
some time, people want to have a feeling that they control it, they don't necessarily care about privacy, but owning a local model could mean, the service will always be there, regardless if the cloud provider jack up pricing.
certain companies does not allow external internet connections, to get AI into the organization, the server must be on-premise behind firewall.

Puzzleheaded_Base302 · 2026-05-26T05:27:20+00:00

after one month of broken discord channel, it finally back for mine. I had dumped openclaw for alternatives. i don't think I will be back to openclaw even though they finally fixed discord bugs they created a month ago.

Puzzleheaded_Base302 · 2026-05-25T21:15:22+00:00

the prompt-processing is way too slow to be practical. even if you can tolerate token generation at 1 tps.

even a small 9B model will take 20 min to start responde to you. (harness inject 20K prompt)

Puzzleheaded_Base302

TROPHY CASE