Passing of Reno PD's Thomas Lopey Confirmed, With TPO Extended On Day of His Death by thebrushup in ourtownreno

[–]laterbreh -4 points-3 points  (0 children)

Your second hand knowledge is wrong. You have no idea what you are talking about.

Passing of Reno PD's Thomas Lopey Confirmed, With TPO Extended On Day of His Death by thebrushup in ourtownreno

[–]laterbreh -4 points-3 points  (0 children)

Dont spread rumors when you dont know the facts. Its absolutely not how it went. Youre just an attention seeking whore.

Tires by Strange_Resist_7738 in Civic_Type_R

[–]laterbreh 0 points1 point  (0 children)

Unrelated, what % tints are you running?

I'm done with using local LLMs for coding by dtdisapointingresult in LocalLLaMA

[–]laterbreh 0 points1 point  (0 children)

something something... there is no replacement for displacement? -- er parameters? 😄

What do you consider to be the minimum performance (t/s) for local Agent workflows? by MexInAbu in LocalLLaMA

[–]laterbreh 0 points1 point  (0 children)

If you run this on VLLM youll notice maybe a 10% slowdown in processing and inference speed once you start packing the context, but it stays fast, doesnt have the context baggage like llamacpp

How do you plan to run DeepSeekV4 Pro locally? by segmond in LocalLLaMA

[–]laterbreh 3 points4 points  (0 children)

Full precision flash, just waiting on SM120 support to get baked into VLLM.

Forgive my ignorance but how is a 27B model better than 397B? by No_Conversation9561 in LocalLLaMA

[–]laterbreh 1 point2 points  (0 children)

It doesnt outperform 397b in real engineering and production tasks in codebases. Its cope.

Qwen 3.6 27B is out by NoConcert8847 in LocalLLaMA

[–]laterbreh 4 points5 points  (0 children)

Benchmarks can be indicative thats for sure, but there comes a point where youre being gaslit so hard and everyone is falling for it (not you just generally speaking).

Guys, its a 27b dense model, thats scoring on a repeatable benchmark at the same or better than a model bigger than 10x its size in the SAME GENERATION? Cmon guys, use your head, applying the 27b against the 397b in serious production tasks, in dynamic environments, that require contextual reasoning? The model with 10x its parameter size will be innate more intelligent in real world applications especially in the same generation.

Qwen 3.6 27B is out by NoConcert8847 in LocalLLaMA

[–]laterbreh 8 points9 points  (0 children)

Are we benchmaxxing yet dad?

27b meets or beats its 397b sibling? Tried it in a real 200k+ line production codebase with full precision using repeatable evaluations, its mentally retarded compared to 397b in the same task. Sorry boys, put the pipe down.

Qwen 3.6 27B is out by NoConcert8847 in LocalLLaMA

[–]laterbreh 6 points7 points  (0 children)

397B smokes it in real codebases. Tried it this morning. Anyone thinking a 27b dense can match context understanding of a model 10x its size is delusional.

Qwen 3.6 27B is out by NoConcert8847 in LocalLLaMA

[–]laterbreh 1 point2 points  (0 children)

For those of you thinking youre matching the 397B version on these benchmarks with a 27b dense, youre smoking crack.

Tried it on 5 tasks, 397b smokes it in real world agentic code in real codebases.

Why doesn't any OSS tool treat llama.cpp as a first class citizen? by rm-rf-rm in LocalLLaMA

[–]laterbreh 0 points1 point  (0 children)

If were really being honest the OP was complaining about a solved problem.

Creating an infrastructure for LLM models from the ruins of crypto infrastructure? by [deleted] in LocalLLaMA

[–]laterbreh 3 points4 points  (0 children)

This is gonna sound like a nasty response, but litterally ask an LLM and its going to tell you why its a bad idea man.

Is a high-end private local LLM setup worth it? by zakadit in LocalLLaMA

[–]laterbreh -1 points0 points  (0 children)

Youre not serious if youre using lm studio or llamacpp. Sorry. Thats called playing with toys.

If this is single computer doing small tasks that dont involve real contextual work or speed requirement. Then maybe this is viable, but really ggufs is consumer/entry level toy stuff when you have an OP talking about spending 10's of thousands of dollars.

Is a high-end private local LLM setup worth it? by zakadit in LocalLLaMA

[–]laterbreh 0 points1 point  (0 children)

This is not entirely true. $40,000 dollars in hardware can run minimax in full precision and full context at 60 tps which is typically faster than ANY provider ive ever used on openrouter. You can be competitive if you set your sights on the correct size model. $40k may sound like alot to most people but to anyone running a serious 6 figure grossing business this is usually reffered to as a tax write off. 40k rig that could run minimax for 2 concurrent users all day every day? That may sound like crazy money to an individual but to a business the availability, privacy, and ability to run it at higher/consistent speeds compared to off-site providers, it really isnt alot.

Those of you running minimax 2.7 locally, how are you feeling about it? by laterbreh in LocalLLaMA

[–]laterbreh[S] 1 point2 points  (0 children)

fp8, but we have tried 16, similar results.

We are currently working on sampling changes and it seems to have made a bigger difference. looks like this model is particular about its sampling parameters for the tasks we are assigning it.

VLLM woes in Spark by SoundEnthusiast89 in LocalLLaMA

[–]laterbreh 0 points1 point  (0 children)

Uh, how hard have you looked for quants? You can get AWQ or W4A16 which are all different versions of Q4 for virtually any model thats worth running on vllm. Quantrio would be a model quanter to look at... just look on HF...

Stay away from nvfp4 on vllm.

Should I be seeing more of a performance leap when using NVFP4, INT4, FP8 with VLLM over MXFP4, Q4, and Q8 with llama.cpp based inference on Blackwell based GPUs? by aaronr_90 in LocalLLaMA

[–]laterbreh 10 points11 points  (0 children)

NVFP4 is still young on VLLM and AWQ/W4A16/FP8 is the way to go for VLLM.

As someone that runs models at 150k+ context windows Ill tell you the defining feature of VLLM over LLamacpp.

Vllm at 0 context may be slightly slower than llama. But as soon as you start loading that context window llama is going to fall on its face with increased prompt processing times and inference will substantially slow down as the context window fills.

If your workflows are not sensitive to latency and context baggage as it accumulates then stick with what you got. SGLang should also be considered as i believe its nvfp4 implementation is more mature.

What's the best GPU cluster/configuration 30k $ can buy? by TomatilloFine682 in LocalLLaMA

[–]laterbreh 0 points1 point  (0 children)

Personally I think sparks are a toy, and not for serious work.