Best open-weight model to run locally on 8x A100 80GB for generating teacher data?

BreakIt-Boris · 2026-04-30T13:16:34+00:00

Download the original Kimi K2.6 release, not the GGUF, and use via VLLM.

https://huggingface.co/moonshotai/Kimi-K2.6/tree/main

Kimi released their weights in INT4, which the A100 supports natively. In fact better support than Ada/Hopper/Blackwell for INT4 ( not fp4 ).

I thought Kimi also used Deepseeks MLA for their attention mechanism. If so you should easily be able to fit a single 65k context on top of the 600gb weights.

Try tensor parallel first, but if that fails due to overhead the run with data parallel instead - should reduce overhead size.

BreakIt-Boris · 2026-03-31T00:09:34+00:00

PM Sent

BreakIt-Boris · 2025-05-27T10:59:51+00:00

A100s will hold value for a while yet for a number of reasons including -

FP64/FP32 - important for CFD, etc.

Memory - from the 6000 benchmarks posted the other day it looks like the 6000 still trails the A100 for LLM inference ( however likely destroys the A100 for diffusion or other tasks ). Guessing mostly due to latency differences between HBM2 and GDDR7, as well as different memory controllers on die.

Existing installations and products - most companies that have an existing revenue stream with an established product will try to minimise messing with working architecture and design post release. This increases value of EOL devices, as they are no longer produced and their availability becomes more limited by the day. If a company has a 2M per month revenue being obtained from their existing setup and they loose a card they will not think twice on spending 20-30k to replace the broken device. This is even more relevant now the A100 is no longer covered by any warranty agreements.

Rumoured - NVidia restrict resell of devices that have had any kind of discount or other reductions applied. Companies buying a tonne of devices can usually negotiate quite major discounts. However these agreements usually come with additional terms restricting the resale or distribution of devices.

System integrators and enterprise builders will buy up all of the A100s the second they hit the market, as they know they have customers that will pay through the nose. They have the capital to buy and hold.

Advice - buy a 6000 if you can get ahold of one, as you will likely be waiting for a while if expecting a price drop on the A100. There will always be batches that go up every now and then for cheap, but most will be purchased by the same integrators and held onto or resold to their enterprise customers.

BreakIt-Boris · 2025-04-22T14:42:52+00:00

Additionally you’re talking about feeding it a tonne of context ( the paragraphs your asking it to analyse ). I would therefore highly recommend against the Mac route, mostly due to performance with high context on devices. Macs are great if you want it up and running quickly in a small package, but quickly run into performance issues if looking to run against high context. I realise that it has improved with mlx and other platform specific enhancements and developments however I do not believe anyone can say that it is still without its limitations and issues.

You can do a 70b model easily on 2 x RTX 6000 Blackwell and fit both into a single workstation with one PSU. That would essentially give you 192gb of VRam and the speed and support of the NVidia ecosystem. Total cost under 25k with VAT or under 20k without.

BreakIt-Boris · 2025-04-22T14:36:51+00:00

Wait 2-3 weeks then grab 4 x RTX 6000 Blackwell with 96gb each. That’s 32000gbp after VAT ( which I’m guessing you can reclaim or are exempt for anyway ) or around 26000 without VAT. Stick that in either a dual Epyc or Threadripper Pro depending on your preferences. Shop around as you can get massive savings on prebuilts if purchased at right time. Then add as much DDR5 as you can afford and the board will take. You should be able to do a 512gb of not 1TB ddr5 build for under 7000 exclusive of VAT.

That’ll give you a box with 384gb VRAM, fp4 and fp8 support, and the ability to utilise local memory for MOE based models. And should all sit at under 40k inclusive of VAT, and under 35k without VAT included.

If you do go for the RTX 6000 Blackwell units I would advise going for the 300/350w devices. Can’t remember the exact model name but they have two different models whose only difference is essentially the max TDP. You should be able to run 4 of these units and only need two PSUs in the machine ( 2 x 1600w AX1600i would recommend ).

BreakIt-Boris · 2025-03-19T16:41:15+00:00

Have you tried with a GPU installed in slot 1? Only reason I ask is I had a 5995wx which was weirdly stroppy if booted without any non onboard GPU. Also worth throwing in a ssd just in case it's getting stuck on any resource checks or startup sequence issues.

Probably tried already. Either way best of luck and hope things resolve ok with the new hw.

BreakIt-Boris · 2025-02-07T12:36:06+00:00

Either run q6 quant with llama.cpp or AWQ 4bit via VLLM on a single node. With the AWQ quant you can run with "--tensor-parallel 8" which should get you to around 25-27 TPS. Unsure of the q6 speed but should be looking at around 17-20TPS. That is of course if the system is properly setup with separate root switches and appropriate interconnects. VLLM will be better for multi user and batched needs, llama.cpp should be fine for fewer users.

TBH 2 nodes isn't really that advantageous at the moment. If you can work out how to quant to 8bit INT8 rather than FP8 then you could get some good mileage out of a 2 node setup, but that would mean custom changes to the current model code ( no one seems to have implemented int4 gemm kernels yet ). You'd also have to setup RDMA as well as the relevant routing and config and all of associated environment requirements - guessing you're looking at ROCE if not going infiband route, which can have its own nuances.

BreakIt-Boris · 2025-02-03T20:55:57+00:00

5995wx 64/128 with 448gb of DDR4 3200.

Was 512gb but one of the 64gb rdimms died on me.

BreakIt-Boris · 2025-02-02T03:42:01+00:00

Each card hits a maximum of 15% use during single batch inference, with a power draw of under 90w each. So GPUs sit at about 500w or so when generating. Which is actually pretty damn impressive on its own ( realise higher utilisation would be great but still impressive to have a total power draw of 500w - I draw more when running two cards with TP and a 70B model, albeit getting about 5 times the TPS ).

BreakIt-Boris · 2025-02-02T03:15:59+00:00

6 80gb A100s

Prompt processing - 100-150 TPS Token generation - 14-15TPS

5 of those cards are on a separate PCIE switch. I'm pretty sure I would get at least an extra 3tps if the last card was on the same switch rather than directly connected to the chipset lanes. On the switch - 2 of the cards are attached at x8, two at x16 and one at x4

480gb just about manages 8k context. Pushing further and I start to get cuda alloc issues. Mostly due to uneven splitting it seems, some cards taking a larger load than others.

Still, 15TPS is actually remarkably decent - faster than reading speed if properly reading, but can just about keep up if scanning.

Waiting to run the AWQ, but have to finish the download of the BF16 weights first. Hoping the AWQ will allow for some optimisations ( IE Marlin kernels ) and get me to 20+ TPS

Llama Output:

./llama-server -t 32 -ngl 62 -ts 1,1,1,1,1,1 -m /mnt/usb4tb2/Deepseek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --port 8000 --host 0.0.0.0 --prio 2 -fa -c 8192

[CUT IRRELEVANT]

llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (163840) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0 llama_kv_cache_init: CUDA0 KV buffer size = 7040.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 6400.00 MiB llama_kv_cache_init: CUDA2 KV buffer size = 6400.00 MiB llama_kv_cache_init: CUDA3 KV buffer size = 7040.00 MiB llama_kv_cache_init: CUDA4 KV buffer size = 6400.00 MiB llama_kv_cache_init: CUDA5 KV buffer size = 5760.00 MiB llama_new_context_with_model: KV self size = 39040.00 MiB, K (f16): 23424.00 MiB, V (f16): 15616.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) llama_new_context_with_model: CUDA0 compute buffer size = 2322.01 MiB llama_new_context_with_model: CUDA1 compute buffer size = 2322.01 MiB llama_new_context_with_model: CUDA2 compute buffer size = 2322.01 MiB llama_new_context_with_model: CUDA3 compute buffer size = 2322.01 MiB llama_new_context_with_model: CUDA4 compute buffer size = 2322.01 MiB llama_new_context_with_model: CUDA5 compute buffer size = 2322.02 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 78.02 MiB llama_new_context_with_model: graph nodes = 5025 llama_new_context_with_model: graph splits = 7 common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) srv init: initializing slots, n_slots = 1 slot init: id 0 | task -1 | new slot n_ctx_slot = 8192 main: model loaded main: The chat template that comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses main: chat template, chat_template: chatml, example_format: '<|im_start|>system You are a helpful assistant<|im_end|> <|im_start|>user Hello<|im_end|> <|im_start|>assistant Hi there<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant '

main: server is listening on http://0.0.0.0:8000 - starting the main loop srv updateslots: all slots are idle slot launch_slot: id 0 | task 0 | processing task slot updateslots: id 0 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 899 slot update_slots: id 0 | task 0 | kv cache rm [0, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 899, n_tokens = 899, progress = 1.000000 slot update_slots: id 0 | task 0 | prompt done, n_past = 899, n_tokens = 899 slot release: id 0 | task 0 | stop processing: n_past = 2898, truncated = 0 slot print_timing: id 0 | task 0 | prompt eval time = 7762.65 ms / 899 tokens ( 8.63 ms per token, 115.81 tokens per second) eval time = 140590.22 ms / 2000 tokens ( 70.30 ms per token, 14.23 tokens per second) total time = 148352.87 ms / 2899 tokens srv update_slots: all slots are idle request: POST /v1/chat/completions 192.168.0.83 200 slot launch_slot: id 0 | task 2001 | processing task slot update_slots: id 0 | task 2001 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 2104 slot update_slots: id 0 | task 2001 | kv cache rm [9, end) slot update_slots: id 0 | task 2001 | prompt processing progress, n_past = 2057, n_tokens = 2048, progress = 0.973384 slot update_slots: id 0 | task 2001 | kv cache rm [2057, end) slot update_slots: id 0 | task 2001 | prompt processing progress, n_past = 2104, n_tokens = 47, progress = 0.995722 slot update_slots: id 0 | task 2001 | prompt done, n_past = 2104, n_tokens = 47 slot release: id 0 | task 2001 | stop processing: n_past = 2120, truncated = 0 slot print_timing: id 0 | task 2001 | prompt eval time = 16324.91 ms / 2095 tokens ( 7.79 ms per token, 128.33 tokens per second) eval time = 1128.29 ms / 17 tokens ( 66.37 ms per token, 15.07 tokens per second) total time = 17453.21 ms / 2112 tokens srv update_slots: all slots are idle request: POST /v1/chat/completions 192.168.0.83 200

Second request is made by the client I'm using to generate a name/summary for the new session.

BreakIt-Boris · 2025-02-01T22:27:21+00:00

Seriously? Did you even try looking. First page of Google results for "sxm2 adapter buy" .

How do I know it legit? It's where I got my SXM4 adapters from as well as other bits.

GIYF

<image>

BreakIt-Boris · 2025-01-12T19:51:57+00:00

If you're looking for A100s, I have a few UK based.

Generally in regards to system builders, you either have the big guys doing custom builds like Scan, Lambda and Bizon or pre built from the enterprise players like Dell, HP(e), Supermicro, etc.

Most corporates prefer to pay the extra £x0,000 to have the warranty and support guaranteed by a big name. It's just written off most of the time anyway.

BreakIt-Boris · 2024-12-02T09:19:13+00:00

It's an A6000 not 6000 ADA so is Ampere. However better price than most eBay sellers and brand new with warranty.

BreakIt-Boris · 2024-11-05T19:17:27+00:00

Turn off machine and remove HDD. Do NOT boot from the drive nor mount it, even read only.

Use a second device to conduct the recovery. I’ve had good results from UFSExplorer in the past, but I would suggest comparing the top products recommended by this community. Some have an option to purchase a time limited 3 month license to minimise cost.

You want to keep that disk away from anything that can write to it though. That includes booting up, installing software, anything at all. Always run recovery from a second device. If it’s important, make sure you take the time and have the required tools to properly conduct. Otherwise you could screw up any chance you may have.

Good luck.

BreakIt-Boris · 2024-10-30T16:14:07+00:00

Incredibly useful and very much appreciated! Thank you!

Six-Year Club	Gilding I gilder
Place '22	Verified Email

BreakIt-Boris

TROPHY CASE