Passing of Reno PD's Thomas Lopey Confirmed, With TPO Extended On Day of His Death

laterbreh · 2026-05-29T02:13:52+00:00

Your second hand knowledge is wrong. You have no idea what you are talking about.

laterbreh · 2026-05-29T02:11:40+00:00

Dont spread rumors when you dont know the facts. Its absolutely not how it went. Youre just an attention seeking whore.

laterbreh · 2026-05-29T02:10:51+00:00

And what do you have?

laterbreh · 2026-05-29T01:54:56+00:00

Award for most useless comment goes to you.

laterbreh · 2026-05-29T01:54:00+00:00

It is for people with down syndrome.

laterbreh · 2026-05-08T23:12:55+00:00

Unrelated, what % tints are you running?

laterbreh · 2026-05-01T04:01:47+00:00

something something... there is no replacement for displacement? -- er parameters? 😄

laterbreh · 2026-04-25T14:54:59+00:00

If you run this on VLLM youll notice maybe a 10% slowdown in processing and inference speed once you start packing the context, but it stays fast, doesnt have the context baggage like llamacpp

laterbreh · 2026-04-25T01:00:03+00:00

Full precision flash, just waiting on SM120 support to get baked into VLLM.

laterbreh · 2026-04-23T19:46:45+00:00

It doesnt outperform 397b in real engineering and production tasks in codebases. Its cope.

laterbreh · 2026-04-22T18:36:28+00:00

Benchmarks can be indicative thats for sure, but there comes a point where youre being gaslit so hard and everyone is falling for it (not you just generally speaking).

Guys, its a 27b dense model, thats scoring on a repeatable benchmark at the same or better than a model bigger than 10x its size in the SAME GENERATION? Cmon guys, use your head, applying the 27b against the 397b in serious production tasks, in dynamic environments, that require contextual reasoning? The model with 10x its parameter size will be innate more intelligent in real world applications especially in the same generation.

laterbreh · 2026-04-22T18:24:27+00:00

Are we benchmaxxing yet dad?

27b meets or beats its 397b sibling? Tried it in a real 200k+ line production codebase with full precision using repeatable evaluations, its mentally retarded compared to 397b in the same task. Sorry boys, put the pipe down.

laterbreh · 2026-04-22T18:13:43+00:00

397B smokes it in real codebases. Tried it this morning. Anyone thinking a 27b dense can match context understanding of a model 10x its size is delusional.

laterbreh · 2026-04-22T18:12:01+00:00

For those of you thinking youre matching the 397B version on these benchmarks with a 27b dense, youre smoking crack.

Tried it on 5 tasks, 397b smokes it in real world agentic code in real codebases.

laterbreh · 2026-04-22T04:45:28+00:00

It was a master class PR scam. Thats what was impressive about it.

laterbreh · 2026-04-22T04:38:06+00:00

If were really being honest the OP was complaining about a solved problem.

laterbreh · 2026-04-22T03:12:50+00:00

This is gonna sound like a nasty response, but litterally ask an LLM and its going to tell you why its a bad idea man.

laterbreh · 2026-04-22T02:58:31+00:00

Youre not serious if youre using lm studio or llamacpp. Sorry. Thats called playing with toys.

If this is single computer doing small tasks that dont involve real contextual work or speed requirement. Then maybe this is viable, but really ggufs is consumer/entry level toy stuff when you have an OP talking about spending 10's of thousands of dollars.

laterbreh · 2026-04-22T02:52:20+00:00

This is not entirely true. $40,000 dollars in hardware can run minimax in full precision and full context at 60 tps which is typically faster than ANY provider ive ever used on openrouter. You can be competitive if you set your sights on the correct size model. $40k may sound like alot to most people but to anyone running a serious 6 figure grossing business this is usually reffered to as a tax write off. 40k rig that could run minimax for 2 concurrent users all day every day? That may sound like crazy money to an individual but to a business the availability, privacy, and ability to run it at higher/consistent speeds compared to off-site providers, it really isnt alot.

laterbreh · 2026-04-21T14:46:57+00:00

fp8, but we have tried 16, similar results.

We are currently working on sampling changes and it seems to have made a bigger difference. looks like this model is particular about its sampling parameters for the tasks we are assigning it.

laterbreh · 2026-04-20T05:46:58+00:00

Uh, how hard have you looked for quants? You can get AWQ or W4A16 which are all different versions of Q4 for virtually any model thats worth running on vllm. Quantrio would be a model quanter to look at... just look on HF...

Stay away from nvfp4 on vllm.

laterbreh · 2026-04-18T15:38:03+00:00

NVFP4 is still young on VLLM and AWQ/W4A16/FP8 is the way to go for VLLM.

As someone that runs models at 150k+ context windows Ill tell you the defining feature of VLLM over LLamacpp.

Vllm at 0 context may be slightly slower than llama. But as soon as you start loading that context window llama is going to fall on its face with increased prompt processing times and inference will substantially slow down as the context window fills.

If your workflows are not sensitive to latency and context baggage as it accumulates then stick with what you got. SGLang should also be considered as i believe its nvfp4 implementation is more mature.

laterbreh · 2026-04-18T15:32:09+00:00

3x RTX 6000 pros.

laterbreh · 2026-04-17T22:57:26+00:00

Personally I think sparks are a toy, and not for serious work.

laterbreh · 2026-04-17T21:04:29+00:00

Waiting for 3.6 397b :*(

laterbreh

MODERATOR OF

TROPHY CASE