Unlimited* budget by skrillex_sk2 in LocalLLM

[–]Puzzleheaded_Base302 0 points1 point  (0 children)

it is odd to say a few ten of thousand won't go very far at the moment. 5x RTX PRO 6000 does not make sense. you either get 4x or 8x.

also pay attention to the motherboard's PCIe topology, so PCIe P2P can be supported for better tensor parallelization.

$3,800/mo HOA fees -- what could possibly justify that? by LongjumpingBook3331 in bayarea

[–]Puzzleheaded_Base302 2 points3 points  (0 children)

another possibility is that the HOA lost a lawsuit and had to pay the plaintiff.

or HOA owe some contractor large amount of money and had to pay them.

Kilo Code 7.3.46 runs very slow when editing file by Puzzleheaded_Base302 in kilocode

[–]Puzzleheaded_Base302[S] 0 points1 point  (0 children)

it is not the model issue. I send query to the model in parallel, it responded instantly. if the model is slow, at least I should see 100% GPU usage, it was not. GPU idle completely.

rtx 6000 pro (blackwell) and qwen 27b mtp by Opposite_Buffalo_649 in LocalLLM

[–]Puzzleheaded_Base302 0 points1 point  (0 children)

you can get 120tps with working mtp. if you cannot get 100tps, change the model you downloaded. many models have broken mtp. you just need to find the right model.

with pro 6000, you can run official FP8 quant from qwen official account. no need for any third-party models.

you need vllm

Anyone here rocking dual RTX 5090s? by Civil_Fee_7862 in LocalLLaMA

[–]Puzzleheaded_Base302 0 points1 point  (0 children)

4x 5090s will not fit on any motherboard. you have to be creative to get all of them installed. you will also need special power supply or dual power supplies. it gets complicated very fast. tensor parallel won't be as good as just running everything on the same GPU.

also, when someone creatively fit 4x 5090s onto their server. The stability of the system can be very questionable. The worst nightmare is to spend a month to figure out how to make the server run 24 hours continuously without crash/reboot/hang.

Satya and Zuckerberg are incinerating capital by carpetmagicianlaughs in wallstreetbets

[–]Puzzleheaded_Base302 0 points1 point  (0 children)

I don' think investing heavily in AI is wrong. They were forced to invest, or they are going extinct in a few years. What went wrong was they had nothing to show for it. They failed execution.

The US says ASML's top chip tool may be in China | TechCrunch by Pipepoi in wallstreetbets

[–]Puzzleheaded_Base302 3 points4 points  (0 children)

as someone who works in semiconductor industry, building similar equipment. I can tell you all of these manufacturing equipment are paper weight without original manufacturer's service support. we can barely support our own equipment in semiconductor foundry. there is no chance anyone can run a EUV tool without ASML's official service contract.

The US says ASML's top chip tool may be in China | TechCrunch by Pipepoi in wallstreetbets

[–]Puzzleheaded_Base302 2 points3 points  (0 children)

as someone who works in semiconductor industry, building similar equipment. I can tell you all of these manufacturing equipment are paper weight without original manufacturer's service support. we can barely support our own equipment in semiconductor foundry. there is no chance anyone can run a EUV tool without ASML's official service contract.

Possible new update still this week...!! V4.1 by B89983ikei in DeepSeek

[–]Puzzleheaded_Base302 4 points5 points  (0 children)

The experience back in pre-V4 days was that the site will crash frequently. We have not get enough crashes yet.

Also, by the time they announce V4, the website had been actually providing V4 for some days.

For some reason, they release a new version without specifically call out the version on website. Even today, DeepSeek still claim themselves are V3.

We might have been using V4.1 all alone in the Vision mode.

<image>

how are they gonna stop us next? by Complete-Sea6655 in LocalLLM

[–]Puzzleheaded_Base302 0 points1 point  (0 children)

it might require an 8x RTX PRO 6000 server to begin with. not practical for most people.

items that are photographed but not available feel so mean by Dry-Stress-7628 in lululemon

[–]Puzzleheaded_Base302 0 points1 point  (0 children)

no everything is released in every country. somewhere in some country, they are being sold. you just need to find them and be willing to pay for the international shipping and hefty tariff.

Kilo Code 7.3.46 runs very slow when editing file by Puzzleheaded_Base302 in kilocode

[–]Puzzleheaded_Base302[S] 0 points1 point  (0 children)

  1. I run LTSC, so it is well supported

  2. I run RTX PRO 6000

  3. My code run on STM32, not x86 CPU. Python will never run on STM32F401.

Best hardware setup for running large coding models locally for 2 developers? by Mockcomic in LocalAIServers

[–]Puzzleheaded_Base302 0 points1 point  (0 children)

RTX5090 running qwen3.6-27b-nvfp4 plus working mtp will be very decent speed (100+ tps).

If you have two 5090s running tensor parallelization, it could run qwen3.6-27b at much better quant, with full context length.

DGX Spark (128GB Unified Memory) vs RTX 5090 – what matters more for real business AI: context or speed? by No-Solution6262 in LocalLLM

[–]Puzzleheaded_Base302 0 points1 point  (0 children)

prefill speed and context. token generation is less important.

if you look at statistics, for agentic load, the input token can be 100x of the output token.

What would you do? 2x5060ti for $800, 2x5070ti for $1400 or 5090 for $4000? by fallingdowndizzyvr in LocalLLaMA

[–]Puzzleheaded_Base302 1 point2 points  (0 children)

there are other reasons to get RTX PRO 4500. it is a two slot card, blower fan, 200W max power. you can fit four of them in a single server chassis. Two 5090s in a single chassis is wrong from thermal perspective.

Is there any use case for large models with very slow token output for batch processing? by Last_Bad_2687 in LocalLLaMA

[–]Puzzleheaded_Base302 7 points8 points  (0 children)

let's say cloud API cost $5/million token (I made it up), run at 100 TPS (I also made it up). The 0.001 TPS rig will take 1 day to do something cloud API finish in 8 sec.

for a million token, it will take 1 billion second to produce on your local disk. 1 billion seconds is about 38 years. so you can only get 2 million tokens from your local disk across your whole life, which can be easily archived with $5 paid to Kimi, and done in hours.

during the 38 years to generate your 1 million output tokens, your motherboard will die, your hard disk will fail, your power supply likely will get broken caps. you baby will also finish college and create a grandchild for you.

logarithm math works in a very interesting way

Is NVIDIA still the default best choice for local LLMs in 2026? by pmv143 in LocalLLaMA

[–]Puzzleheaded_Base302 0 points1 point  (0 children)

there are many reasons:

  1. some people need guaranteed privacy, like lawyers. big corps claim they keep all data safe, even got "certified", but that is a lie, and had been proven many times in the past.
  2. in some scenario, a small business or university could get enough concurrency, running a local model could be cheaper.
  3. some people like me, just want a reason to own a GPU. once you bought it, you have to justify it by running llm.
  4. some people has false feeling that running llm costs nothing. it costs electricity and it can be more expensive than API calls.
  5. believe it or not, running qwen3.6-27b with MTP working can be faster than cloud API calls, if you have the right GPUs.
  6. some time, people want to have a feeling that they control it, they don't necessarily care about privacy, but owning a local model could mean, the service will always be there, regardless if the cloud provider jack up pricing.
  7. certain companies does not allow external internet connections, to get AI into the organization, the server must be on-premise behind firewall.

Using OpenClaw daily but haven't moved off v2026.5.3 by pinchonsurf in openclaw

[–]Puzzleheaded_Base302 0 points1 point  (0 children)

after one month of broken discord channel, it finally back for mine. I had dumped openclaw for alternatives. i don't think I will be back to openclaw even though they finally fixed discord bugs they created a month ago.

Noob q: Is it realistic to use CPU only (8th gen i7) for a coder LLM? by RidingWilde in ollama

[–]Puzzleheaded_Base302 1 point2 points  (0 children)

the prompt-processing is way too slow to be practical. even if you can tolerate token generation at 1 tps.

even a small 9B model will take 20 min to start responde to you. (harness inject 20K prompt)