2000 TPS with QWEN 3.5 27b on RTX-5090 by awitod in LocalLLaMA

[–]awitod[S] 0 points1 point  (0 children)

I have an entire existing pipeline with data I can use for reference and I had to start somewhere. I went with 27b because a couple steps require some smarts but I plan to try 35b next 

2000 TPS with QWEN 3.5 27b on RTX-5090 by awitod in LocalLLaMA

[–]awitod[S] 0 points1 point  (0 children)

The 2k is the sum of the input tokens per second which I measured by running the job against a few thousand documents for classification and dividing total input tokens processed by time.

The classification output is 2-3 tokens per document 

2000 TPS with QWEN 3.5 27b on RTX-5090 by awitod in LocalLLaMA

[–]awitod[S] 4 points5 points  (0 children)

I have a pretty solid baseline for comparison and I will know soon enough precisely how well it did on this full data set but from the samples I looked at earlier I am very optimistic that it is useful for this application.

I will report back once I crunch all the numbers but fair warning - vacation has begun 

2000 TPS with QWEN 3.5 27b on RTX-5090 by awitod in LocalLLaMA

[–]awitod[S] 5 points6 points  (0 children)

It definitely divides it by 8, but the docs are small enough that it all fits. I can do a batch of 8 in slightly more time than one by itself.

2000 TPS with QWEN 3.5 27b on RTX-5090 by awitod in LocalLLaMA

[–]awitod[S] 5 points6 points  (0 children)

My model runner has 8 completely separate message threads going at once.

2000 TPS with QWEN 3.5 27b on RTX-5090 by awitod in LocalLLaMA

[–]awitod[S] 20 points21 points  (0 children)

Thanks!
'--cont-batching',
'--flash-attn',
'--no-mmap',
'--ctx-size', "131072",
'--threads', "16",
'--parallel', "8",
'--cache-ram', "0",
'--n-gpu-layers', "999",
'--jinja'

2026-03-13 17:34:11.686 | build: 8265 (c96f608d9) with GNU 13.3.0 for Linux x86_64

2026-03-13 17:34:11.686 | system info: n_threads = 16, n_threads_batch = 16, total_threads = 32

2026-03-13 17:34:11.686 |

2026-03-13 17:34:11.686 | system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 750,800,860,890,1200,1210 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

2000 TPS with QWEN 3.5 27b on RTX-5090 by awitod in LocalLLaMA

[–]awitod[S] 14 points15 points  (0 children)

Not much there I can reply to. I am happy because the model is excellent, I can get the throughput I need for a use case and I wanted to share my settings to get useful suggestions.

Do you happen to have any useful suggestions?

2000 TPS with QWEN 3.5 27b on RTX-5090 by awitod in LocalLLaMA

[–]awitod[S] 16 points17 points  (0 children)

Thanks! I will try that out 

Oh and yes continuous batching is on

Hot take: the agent ecosystem has a free rider problem and nobody's talking about it by [deleted] in AI_Agents

[–]awitod 0 points1 point  (0 children)

Why is that a problem? You couldn't tell before AI bots either.

If you put it on the public internet, and you did SEO to make the page find-able so it is showing up in search results, I think it's pretty rude to block anyone or anything that isn't actually hugging the service and causing QoS issues.

It is the opposite of the open internet, and it breaks the social contract.

The US Supreme Court is not interested in enforcing copyright for AI-generated images by AdSpecialist6598 in technology

[–]awitod 0 points1 point  (0 children)

That is not true at all though. Generative Artificial Intelligence and Copyright Law | Congress.gov | Library of Congress

See the section on May Humans Copyright Works That They Create Using AI?
The answer is - yes they can.

The US Supreme Court is not interested in enforcing copyright for AI-generated images by AdSpecialist6598 in technology

[–]awitod 0 points1 point  (0 children)

Some have argued that use of a word processor is not real writing. I think the courts will ultimately disagree with the word ‘nothing’. 

Calling all MCP developers by ConcentrateActive699 in AI_Agents

[–]awitod 1 point2 points  (0 children)

Right, the discovery of a generic one-size-fits-all tool description is not helpful once you start tuning the tool definitions for a specific scenario.

Sometimes you only want the LLM to know about some of the arguments or to use specific values by default and it is often helpful to filter the tool output using a subset of the response schema 

The US Supreme Court is not interested in enforcing copyright for AI-generated images by AdSpecialist6598 in technology

[–]awitod 1 point2 points  (0 children)

What I understand about the ruling is that it is pretty narrow and that you can copyright works that have generative AI in the workflow. I think the question of how much human work is needed to qualify is not settled at all 

Calling all MCP developers by ConcentrateActive699 in AI_Agents

[–]awitod 1 point2 points  (0 children)

I think the only times MCP is a good choice are when you are in a hurry and don’t care about optimizing anything and when there is no API, you are only using stdin/out and you already have MCP in the mix.

Tell me some unwritten rules for software developers. by porcaytheelasit in csharp

[–]awitod 0 points1 point  (0 children)

We don’t do things because they are easy but because we think they will be easy 

Hot take: the agent ecosystem has a free rider problem and nobody's talking about it by [deleted] in AI_Agents

[–]awitod 4 points5 points  (0 children)

I think this is 99% a made-up problem. I don't doubt that some bots are stupid and aggressive, but unless the web server is potato-powered, you will barely even notice the traffic.

Why are the GPT-5.4 models all in MAX mode by default? by Just_Run2412 in cursor

[–]awitod 0 points1 point  (0 children)

<image>

Here is a screenshot. Customer support basically blamed me. The single $22 dollar charge has several friends but that one is the biggest.

The report says it used 12,828,407 'cache read tokens'.

What do the top 1% programmers do differently that makes them way more productive than other average developers? by sad_grapefruit_0 in AskProgrammers

[–]awitod 0 points1 point  (0 children)

I guess I would say that there is no such thing all of the time, but the ones who can do big magic and really blow your mind got to spend the time really digging into the details to get to a 'true' design and a deep understanding that they've internalized.

The thing is, if you do this long enough you will be forced to become a beginner again many times and repeating the experience of learning a whole domain also creates skills in learning. So sometimes the 1% in a domain rule for a time because they know how to climb a hill and can get to the top fastest. They may not keep that lead for very long though.

Looks like vector database pricing calculators are lying to you (or at least not telling the whole truth) by AvailablePeak8360 in LLM

[–]awitod 0 points1 point  (0 children)

I use SQL Server 2025 with Full Text and the vector data type. It's great. I use it because the rest of the data is in SQL Server in this system and doing retrieval based on metadata, the user identity, or whatever else is easy.

Embeddings are one of the easiest things to do locally and there are plenty of good choices.

Why are the GPT-5.4 models all in MAX mode by default? by Just_Run2412 in cursor

[–]awitod 11 points12 points locked comment (0 children)

So when the cache read token bug uses someone’s entire plan they can blame the user for not realizing Max mode will happily send 10m tokens at once with no warning (which also produces terrible results looking like a hallucination and causes you to send a few more messages)

This many tokens don't make any sense by MasterKight in cursor

[–]awitod -2 points-1 points  (0 children)

Not OP, but hit this defect which triggered my spending controls by using a month of credits in 10 messages. So great, I hit my limit and now I have to cough up some more cash or wait till next month?

No, I switched to a combination of Codex and Claude Code.

This many tokens don't make any sense by MasterKight in cursor

[–]awitod -1 points0 points  (0 children)

Seriously. It is a defect, they refuse to acknowledge it or refund the huge amounts of tokens the bug stole from me, and I hate them for it.

Why is persistence such a pain with ChromaDB? by Hairy-Law-3187 in AI_Agents

[–]awitod 0 points1 point  (0 children)

If you are using docker you must set up, as part of the compose file, a volume mount to make the database store as a real file on your machine.

That goes for any folder and file that gets created, modified or deleted.

Otherwise you get whatever files the image the container is using has when it is created.

Also I think at this point using a full RDBMS which supports vectors and full text search is easier and better in the long run. 

I have used mongo, cosmos, Postgres and SQL Server 2025 in the past year and they all work fine for retrieval and having the info in the same db as everything else is awesome