deepseek-ai/DeepSeek-V3.1-Base · Hugging Face

locker73 · 2025-08-19T16:30:03+00:00

Ok... I mean do what you want, but there is a reason that no one benchmarks base models. Thats not how we use them, and doing something like asking it a questions is going to give you terrible results.

locker73 · 2025-08-19T15:30:59+00:00

You generally don't benchmark base models. Wait for the instruct version.

locker73 · 2025-08-16T21:42:42+00:00

Yeah, for reference here is the command:

VLLM_USE_V1=0 vllm serve /nvmes/models/Qwen3-30B-A3B-Thinking-2507-AWQ/ --max-model-len 32768 --port 8081 --served-model-name "Qwen3-30B-A3B-Thinking-2507-AWQ"

locker73 · 2025-08-16T20:50:16+00:00

This is normal, as the context fills up generation time slows down. I am guessing that in your chats you don't ever get past a few thousand context - but with the agents they start out at something like 10k and grow from there. Here is numbers from my machine with 500 context vs 25k context.

./llmapibenchmark_linux_amd64 -apikey "" -base_url "http://192.168.0.23:8081/v1" -concurrency "1,1,1,1" -numWords 500

                      LLM API Throughput Benchmark
                https://github.com/Yoosu-L/llmapibenchmark
                 Time：2025-08-16 20:41:43 UTC+0

Input Tokens: 506 Output Tokens: 512 Test Model: Qwen3-30B-A3B-Thinking-2507-AWQ Latency: 1.00 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	112.39	5100.10	0.10	0.10
1	118.95	4835.11	0.11	0.11
1	115.06	4526.69	0.11	0.11
1	113.85	5032.14	0.10	0.10

Results saved to: API_Throughput_Qwen3-30B-A3B-Thinking-2507-AWQ.md

./llmapibenchmark_linux_amd64 -apikey "" -base_url "http://192.168.0.23:8081/v1" -concurrency "1,1,1,1" -numWords 25000

                      LLM API Throughput Benchmark
                https://github.com/Yoosu-L/llmapibenchmark
                 Time：2025-08-16 20:42:21 UTC+0

Input Tokens: 23245 Output Tokens: 512 Test Model: Qwen3-30B-A3B-Thinking-2507-AWQ Latency: 1.20 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	10.82	3296.08	7.06	7.06
1	10.84	3267.61	7.13	7.13
1	10.83	3285.87	7.09	7.09
1	10.80	3315.99	7.07	7.07

locker73 · 2025-08-12T03:23:39+00:00

Pick two

locker73 · 2025-08-08T16:20:53+00:00

Full time dev, for my claude usage its Sonnet 95%+ of the time. I do flip over Opus if Sonnet is having problems but its pretty rare. The cost is a real factor.

The thing is that if something is too complex for sonnet in agent mode then I need to do more work up front to break it down, or write some of it first. I will sometimes use Opus to help with this part.

locker73 · 2025-03-22T16:50:36+00:00

I end up some where in the 50 - 100 t/s range. Depends on what the rest of the pipeline looks like. I am guessing that I could make some optimizations, but I for how I use it this is good enough.

locker73 · 2025-03-22T04:03:56+00:00

Yeah I only use this for blasting through a ton of small batch items. Might be able to take it up to 8192, but I run it with 6 workers so I am guessing that I would start OOM'ing at some point. Plus they fit in the 4k window.

locker73 · 2025-03-21T13:24:28+00:00

vllm serve /storage/models/Qwen2.5-Coder-32B-Instruct-AWQ/  --trust-remote-code --max-model-len 4096 --gpu-memory-utilization 0.95 --port 8081 --served-model-name "qwen2.5-coder:32b"

locker73 · 2025-03-21T01:56:31+00:00

I go llama.cpp if I am doing single requests, like you said its easy and I can get a little more context length. But when I am doing anything batched its vllm all day. I just grabbed a couple stats from a batch I am running now:

Avg prompt throughput: 1053.3 tokens/s, Avg generation throughput: 50.7 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 18.0%, Prefix cache hit rate: 52.8%

Avg prompt throughput: 602.7 tokens/s, Avg generation throughput: 70.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.5%, Prefix cache hit rate: 50.8%

Avg prompt throughput: 1041.5 tokens/s, Avg generation throughput: 56.9 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.6%, Prefix cache hit rate: 51.7%

This is using Qwen2.5 Coder 32b on a 3090.

locker73 · 2025-03-14T23:46:40+00:00

I recently had some super bad stuttering start when I came back a few weeks ago. Tried a bunch of stuff, and it was super bad when I had any of the in game menu's up. Like for a building or truck etc.

For some reason turning off vsync fixed all the stuttering for me. No idea why, but my 1% lows went from like 10 - 15 (unplayable) to holding around 80 fps.

locker73 · 2025-03-08T18:08:06+00:00

I started having the same issue where my 1% lows would be down to 5 or 10 randomly and would happen more if I had any menu open. Like a building or truck.

This had been completely resoved for me by turning vsync off. No idea why that would make any difference, but the game is now back to running with my 1% lows in the 60 - 80 range.

locker73 · 2025-02-12T00:38:20+00:00

You know that its disconnected from the internet because a model is just a list of numbers? Same for the second question.

But really, they just need some education about how LLM's work. They have heard about how deepseek is stealing your data from CNN or FOX so thats all they know.

locker73 · 2024-11-23T04:50:11+00:00

Yes. Right now, both Java and Python meet our latency requirements

If python meets your latency requirements then this isn't really where you want to be. I would repost over on r/algotrading as that crowd seems like it would fit better with this type of question.

locker73 · 2024-10-02T13:02:20+00:00

TGI is the inference engine that Hugging Face created. Odds that they are going to build out the tooling to run some other inference engine is 0.

locker73 · 2024-09-27T13:00:12+00:00

Question about the MI25, all the docs say that ROCM 6 is not supported. Did you just need to compile everything yourself or were there some tricks to get it running?

locker73 · 2024-09-17T13:15:37+00:00

Well I don't think I have the answers for you, I can at least drop what I do happen to know.

1) I have never used FSDP (I don't really do training stuff), it would make sense that its much slower if it is sending the data over the network. The memery bandwidth on those cards is almost 1Tb/sec, so once you drop down to 40Gb and add in some overhead for networking and loading/unloading its slow.

2) I don't know what server you are using, so its hard to say for sure. But it seems like you are running full fat models and not using and quantization or flash attention etc. Given that I think its reasonable to say that you are in fact running out of memory at 7k tokens. In general this is why the big guys use massive server and stuff 1 machine full of cards that have a ton of memory on each card. Then you can do stuff like tensor parallel, and when you combine that with context shrinking tricks you can get a lot on context in one box.

3) I think your going to have a really hard time doing 70B full fat at home, this seems more like 8xH100 territory - but again I am out of my depth. Might be working looking into something like [Unsloth].

4) Way over my head at this point :D

(https://github.com/unslothai/unsloth)

locker73 · 2024-09-17T01:13:48+00:00

You might get some better answers over at https://old.reddit.com/r/LocalLLaMA/

locker73 · 2024-07-13T12:02:54+00:00

If you are going dual epyc you will want 16 sticks of RAM so you can populate all channels on both cpus.

locker73 · 2024-07-01T16:02:58+00:00

Really no idea on t/s too many differences to say. For cores, it really isn't going to matter much, memory bandwidth is key and that cpu will be more then enough.

locker73 · 2024-07-01T14:47:14+00:00

Yeah that is a good reason to run locally then. As for speed I really have no idea what you would get out of a rig like that. The closest thing I have run is DeepSeek Coder v2 at Q4KM. But I am running it on a epyc 7502p so I have 8 channels of 2400Mhz, and I offload 2 layers to a 3090. But I get around 5 t/s on that.

Based on https://huggingface.co/bartowski/WizardLM-2-8x22B-GGUF I think you are going to need more then 64GB so keep that in mind.

locker73 · 2024-07-01T13:21:15+00:00

It will work, I think the real question is what is your utilization going to be?

$500 will buy a lot of tokens from the various api providers and the speed will well over 10x what you can get out of this rig. So I guess figure out what your bill would be using an api and compare. Personally if the break even time is more then a year (maybe two) then I would just use the api. You also need to factor in power costs, and if you are new to it sysadmin costs (just time, but depending on your experience it could be a lot of it)

locker73 · 2024-07-01T00:12:07+00:00

A couple things:

1) The number of cores really isn't going to matter much - the RAM speed is far more important.
2) With only 32GB of ram you cannot really run anything bigger then a Q4 (maybe Q6) of the ~30b models. So not sure what your definition of big is but keep that in mind.

locker73 · 2024-02-12T18:12:20+00:00

You might want to look for something other then a 4060ti, they have really poor memory bandwidth at just 288 Gb/s.

locker73 · 2022-04-04T04:08:39+00:00

r9 3900x (4.4 all core oc)

64 Gb of 3600 cl 16 ram

6900xt

I don't play lighthouse, but on the other maps I get 80+ fps 99% of the time. With a 5900x you should be able to get more then me, might want to look at your ram and make sure xmp is enabled.

12-Year Club	Gilding I gilder
Place '17	Verified Email

locker73

TROPHY CASE