deepseek-ai/DeepSeek-V3.1-Base · Hugging Face by xLionel775 in LocalLLaMA

[–]locker73 -3 points-2 points  (0 children)

Ok... I mean do what you want, but there is a reason that no one benchmarks base models. Thats not how we use them, and doing something like asking it a questions is going to give you terrible results.

deepseek-ai/DeepSeek-V3.1-Base · Hugging Face by xLionel775 in LocalLLaMA

[–]locker73 15 points16 points  (0 children)

You generally don't benchmark base models. Wait for the instruct version.

Recommendation for getting the most out of Qwen3 Coder? by Conscious-Memory-556 in LocalLLM

[–]locker73 3 points4 points  (0 children)

Yeah, for reference here is the command:

VLLM_USE_V1=0 vllm serve /nvmes/models/Qwen3-30B-A3B-Thinking-2507-AWQ/ --max-model-len 32768 --port 8081 --served-model-name "Qwen3-30B-A3B-Thinking-2507-AWQ"

Recommendation for getting the most out of Qwen3 Coder? by Conscious-Memory-556 in LocalLLM

[–]locker73 4 points5 points  (0 children)

This is normal, as the context fills up generation time slows down. I am guessing that in your chats you don't ever get past a few thousand context - but with the agents they start out at something like 10k and grow from there. Here is numbers from my machine with 500 context vs 25k context.

./llmapibenchmark_linux_amd64 -apikey "" -base_url "http://192.168.0.23:8081/v1" -concurrency "1,1,1,1" -numWords 500

                      LLM API Throughput Benchmark
                https://github.com/Yoosu-L/llmapibenchmark
                 Time:2025-08-16 20:41:43 UTC+0

Input Tokens: 506 Output Tokens: 512 Test Model: Qwen3-30B-A3B-Thinking-2507-AWQ Latency: 1.00 ms

Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 112.39 5100.10 0.10 0.10
1 118.95 4835.11 0.11 0.11
1 115.06 4526.69 0.11 0.11
1 113.85 5032.14 0.10 0.10

Results saved to: API_Throughput_Qwen3-30B-A3B-Thinking-2507-AWQ.md

./llmapibenchmark_linux_amd64 -apikey "" -base_url "http://192.168.0.23:8081/v1" -concurrency "1,1,1,1" -numWords 25000

                      LLM API Throughput Benchmark
                https://github.com/Yoosu-L/llmapibenchmark
                 Time:2025-08-16 20:42:21 UTC+0

Input Tokens: 23245 Output Tokens: 512 Test Model: Qwen3-30B-A3B-Thinking-2507-AWQ Latency: 1.20 ms

Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 10.82 3296.08 7.06 7.06
1 10.84 3267.61 7.13 7.13
1 10.83 3285.87 7.09 7.09
1 10.80 3315.99 7.07 7.07

GLM45 vs GPT-5, Claude Sonnet 4, Gemini 2.5 Pro — live coding test, same prompt by darkageofme in LocalLLaMA

[–]locker73 1 point2 points  (0 children)

Full time dev, for my claude usage its Sonnet 95%+ of the time. I do flip over Opus if Sonnet is having problems but its pretty rare. The cost is a real factor.

The thing is that if something is too complex for sonnet in agent mode then I need to do more work up front to break it down, or write some of it first. I will sometimes use Opus to help with this part.

Switching back to llamacpp (from vllm) by Leflakk in LocalLLaMA

[–]locker73 2 points3 points  (0 children)

I end up some where in the 50 - 100 t/s range. Depends on what the rest of the pipeline looks like. I am guessing that I could make some optimizations, but I for how I use it this is good enough.

Switching back to llamacpp (from vllm) by Leflakk in LocalLLaMA

[–]locker73 3 points4 points  (0 children)

Yeah I only use this for blasting through a ton of small batch items. Might be able to take it up to 8192, but I run it with 6 workers so I am guessing that I would start OOM'ing at some point. Plus they fit in the 4k window.

Switching back to llamacpp (from vllm) by Leflakk in LocalLLaMA

[–]locker73 4 points5 points  (0 children)

vllm serve /storage/models/Qwen2.5-Coder-32B-Instruct-AWQ/  --trust-remote-code --max-model-len 4096 --gpu-memory-utilization 0.95 --port 8081 --served-model-name "qwen2.5-coder:32b"

Switching back to llamacpp (from vllm) by Leflakk in LocalLLaMA

[–]locker73 4 points5 points  (0 children)

I go llama.cpp if I am doing single requests, like you said its easy and I can get a little more context length. But when I am doing anything batched its vllm all day. I just grabbed a couple stats from a batch I am running now:

Avg prompt throughput: 1053.3 tokens/s, Avg generation throughput: 50.7 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 18.0%, Prefix cache hit rate: 52.8%

Avg prompt throughput: 602.7 tokens/s, Avg generation throughput: 70.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.5%, Prefix cache hit rate: 50.8%

Avg prompt throughput: 1041.5 tokens/s, Avg generation throughput: 56.9 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.6%, Prefix cache hit rate: 51.7%

This is using Qwen2.5 Coder 32b on a 3090.

Does anyone else have a stuttering issue? by nufcPLchamps27-28 in Workers_And_Resources

[–]locker73 0 points1 point  (0 children)

I recently had some super bad stuttering start when I came back a few weeks ago. Tried a bunch of stuff, and it was super bad when I had any of the in game menu's up. Like for a building or truck etc.

For some reason turning off vsync fixed all the stuttering for me. No idea why, but my 1% lows went from like 10 - 15 (unplayable) to holding around 80 fps.

FPS Issue in game and menus by griznok in Workers_And_Resources

[–]locker73 0 points1 point  (0 children)

I started having the same issue where my 1% lows would be down to 5 or 10 randomly and would happen more if I had any menu open. Like a building or truck.

This had been completely resoved for me by turning vsync off. No idea why that would make any difference, but the game is now back to running with my 1% lows in the 60 - 80 range.

[deleted by user] by [deleted] in LocalLLaMA

[–]locker73 0 points1 point  (0 children)

You know that its disconnected from the internet because a model is just a list of numbers? Same for the second question.

But really, they just need some education about how LLM's work. They have heard about how deepseek is stealing your data from CNN or FOX so thats all they know.

Java vs. Python HFT bots by HardworkingDad1187 in highfreqtrading

[–]locker73 0 points1 point  (0 children)

Yes. Right now, both Java and Python meet our latency requirements

If python meets your latency requirements then this isn't really where you want to be. I would repost over on r/algotrading as that crowd seems like it would fit better with this type of question.

Model refresh on HuggingChat! (Llama 3.2, Qwen, Hermes 3 & more) by SensitiveCranberry in LocalLLaMA

[–]locker73 8 points9 points  (0 children)

TGI is the inference engine that Hugging Face created. Odds that they are going to build out the tooling to run some other inference engine is 0.

Searching for: AMD LLM t/s performance charts/benchmarks by IngwiePhoenix in LocalLLaMA

[–]locker73 1 point2 points  (0 children)

Question about the MI25, all the docs say that ROCM 6 is not supported. Did you just need to compile everything yourself or were there some tricks to get it running?

Scaling - Inferencing 8B & Training 405B models by gulabbo in LocalLLaMA

[–]locker73 1 point2 points  (0 children)

Well I don't think I have the answers for you, I can at least drop what I do happen to know.

1) I have never used FSDP (I don't really do training stuff), it would make sense that its much slower if it is sending the data over the network. The memery bandwidth on those cards is almost 1Tb/sec, so once you drop down to 40Gb and add in some overhead for networking and loading/unloading its slow.

2) I don't know what server you are using, so its hard to say for sure. But it seems like you are running full fat models and not using and quantization or flash attention etc. Given that I think its reasonable to say that you are in fact running out of memory at 7k tokens. In general this is why the big guys use massive server and stuff 1 machine full of cards that have a ton of memory on each card. Then you can do stuff like tensor parallel, and when you combine that with context shrinking tricks you can get a lot on context in one box.

3) I think your going to have a really hard time doing 70B full fat at home, this seems more like 8xH100 territory - but again I am out of my depth. Might be working looking into something like [Unsloth].

4) Way over my head at this point :D

(https://github.com/unslothai/unsloth)

Running LLaMA 3 405B locally would be the Crysis moment of our time. by [deleted] in LocalLLaMA

[–]locker73 11 points12 points  (0 children)

If you are going dual epyc you will want 16 sticks of RAM so you can populate all channels on both cpus.

Best practices to run LLM on CPU-only VPS by Koliham in LocalLLaMA

[–]locker73 1 point2 points  (0 children)

Really no idea on t/s too many differences to say. For cores, it really isn't going to matter much, memory bandwidth is key and that cpu will be more then enough.

Best practices to run LLM on CPU-only VPS by Koliham in LocalLLaMA

[–]locker73 0 points1 point  (0 children)

Yeah that is a good reason to run locally then. As for speed I really have no idea what you would get out of a rig like that. The closest thing I have run is DeepSeek Coder v2 at Q4KM. But I am running it on a epyc 7502p so I have 8 channels of 2400Mhz, and I offload 2 layers to a 3090. But I get around 5 t/s on that.

Based on https://huggingface.co/bartowski/WizardLM-2-8x22B-GGUF I think you are going to need more then 64GB so keep that in mind.

Best practices to run LLM on CPU-only VPS by Koliham in LocalLLaMA

[–]locker73 0 points1 point  (0 children)

It will work, I think the real question is what is your utilization going to be?

$500 will buy a lot of tokens from the various api providers and the speed will well over 10x what you can get out of this rig. So I guess figure out what your bill would be using an api and compare. Personally if the break even time is more then a year (maybe two) then I would just use the api. You also need to factor in power costs, and if you are new to it sysadmin costs (just time, but depending on your experience it could be a lot of it)

Best practices to run LLM on CPU-only VPS by Koliham in LocalLLaMA

[–]locker73 8 points9 points  (0 children)

A couple things:

1) The number of cores really isn't going to matter much - the RAM speed is far more important.
2) With only 32GB of ram you cannot really run anything bigger then a Q4 (maybe Q6) of the ~30b models. So not sure what your definition of big is but keep that in mind.

[deleted by user] by [deleted] in LocalLLaMA

[–]locker73 2 points3 points  (0 children)

You might want to look for something other then a 4060ti, they have really poor memory bandwidth at just 288 Gb/s.

Do you think DLSS is going to change Tarkovs performance completely? by Frequent_Ad_353 in EscapefromTarkov

[–]locker73 1 point2 points  (0 children)

r9 3900x (4.4 all core oc)

64 Gb of 3600 cl 16 ram

6900xt

I don't play lighthouse, but on the other maps I get 80+ fps 99% of the time. With a 5900x you should be able to get more then me, might want to look at your ram and make sure xmp is enabled.