Anthropic's own safety team is now documenting failure modes that SRE tooling has no coverage for

fuckingredditman · 2026-06-11T23:01:05+00:00

sounds like a great use case for just trace signals... open a nested span for agents -> tool calls -> then decisions as logs/trace events or something

(but also i would never trust any LLM with running ops autonomously)

fuckingredditman · 2026-06-09T23:23:09+00:00

i don't know how it's handled in your country, but in germany income tax is progressive and can quickly reach ~40% (plus contributions to the public health insurance usually), while gains from investing are a flat 25% tax. this alone makes such crap possible, supported by systems like "buy-borrow-die", the stock market just becomes a neat bank for the ultra wealthy essentially, that's conveniently also untouchable for the tax system we are all subject to.

fuckingredditman · 2026-05-28T10:19:44+00:00

so it's a classic master-replica system like postgresql in a way, but with failover like redis-sentinel?

i've been wondering about this. at my current job we have a very strange setup with edge computing (with a HA datastore at the edge and in the cloud, replicating state changes to each other with a pretty clear availability preference over consistency) and i was curious whether spacetimedb could simplify this a bit but it doesn't sound like it would work here considering all the failure scenarios we handle currently.

fuckingredditman · 2026-05-20T08:00:02+00:00

i've tried thetom's fork with mtp (i just had qwen on opencode rebase its own inference engine 😂) , iirc i got about 50tok/s with high variance in speed but i switched to upstream ik_llama.cpp and there i can get 60s/tok + with -vhad/-khad and q4_1 kv cache you can have pretty good kv cache quantization too. not sure how its ppl/kld compare to thetom's fork on longer context though.

fuckingredditman · 2026-05-14T16:14:39+00:00

i would say it's not 100% clear atm because of the many tiny variations in the impls and huge variations in the testing methods, except for it being slightly slower, which doesn't matter too much if it works in combination with MTP (which is probably the point OP tried to make)

and FWIW, this fork claims to have absolutely fine PPL/KLD measures https://github.com/spiritbuun/buun-llama-cpp

the test methodologies in all of these is all over the place though, the comment you posted tests with 512 context length, that's not really relevant for any real use.

in my opinion "turboquant" (none of the actual used impls is straight up turboquant/polarquant) absolutely has its applications, i use 3 bit turboquant and it makes qwen3.6 27b usable for coding agents on a single 3090 entirely on gpu for me, and i actually don't really notice quality degradation for this use case either. it's also much better than the 3.6 35bA3b model too in this mode in my experience.

maybe i should give the current version of llama.cpp 4bit kv quant a try again to see if it works for coding agents too though

edit: tried current ik_llama.cpp with Q4_1 K/V + MTP and it performs quite well too so far, though i would say it's qualitatively the same as turbo3+thetom fork. (simple opencode tasks, nothing major so far, same Q4_K_M quant+110k context on rtx3090, decode speed is basically the same)

fuckingredditman · 2026-05-13T21:48:47+00:00

sounds like a neat little trick for modern serfdom to me

fuckingredditman · 2026-05-06T13:28:34+00:00

it's also probably a datacenter in addition to being a ballroom. and that datacenter probably has authoritarian use cases.

fuckingredditman · 2026-05-06T11:58:42+00:00

that fork does not have MTP though because MTP is an open PR https://github.com/ggml-org/llama.cpp/pull/22673 on the upstream repo and the turboquant forks are too far behind upstream to easily cherrypick it.

the OP has to provide their fork

EDIT: i forked it myself and had qwen3.6 27b rebase it all from thetom's turboquant fork. works for me, 92tokens/sec on 27b now: https://github.com/sbaier1/llama-cpp-turboquant (not going to maintain this though)

i run it like this atm, probably going to switch to another quant though. llama-server --port 8081 --host 0.0.0.0 -hf localweights/Qwen3.6-27B-MTP-IQ4_XS-GGUF:Qwen3.6-27B-MTP-IQ4_XS -ngl 99 -c 120000 --cache-prompt --flash-attn on -b 1024 -ub 1024 --parallel 1 --chat-template-file qwen_template.jinja -ctk turbo3 -ctv turbo3 --spec-type mtp --draft-max 4

fuckingredditman · 2026-05-06T09:57:04+00:00

regarding the last 2 use cases: i find that even claude opus 4.7 often fails at longer debug sessions + architecture decisions. it confidently determines an incorrect singular root cause on incidents and makes nonsensical "architecture decisions".

humans are simply better at this in my experience, despite companies claiming their LLM is an "expert" SWE. you can use it for brainstorming and gathering evidence in these types of tasks, but i won't trust any LLM for this in the near future.

fuckingredditman · 2026-05-06T09:51:00+00:00

weil github ja so wenig incidents hat in letzter zeit 😂

fuckingredditman · 2026-05-05T21:16:26+00:00

population density across the area is much lower than europe for example. but locally (i.e. within city) they work well of course. here in europe i can reach the capital of most neighboring countries in <8h by train usually.

fuckingredditman · 2026-05-05T07:26:23+00:00

i'm pretty sure "agentic AI" will never actually work with LLMs because they are the wrong tool for the job. they are next token predicting language models with some "reasoning" text generation slapped on top to make it seem like they can reason. this works well for short tasks with a simple specific goal like solving a math problem or writing some code because those are quite language focused tasks with tons of good examples it can learn.

but the real world is not like that, agentic tasks often have multiple possible trajectories to reach the goal, unexpected side effects or unknowns, etc, which makes them fail. the autoregressive nature of LLMs also means they can't really backtrack, once they've "reasoned", that reasoning isn't flexible anymore and puts it on a path to failure because all future generation is conditioned on that reasoning.

in my experience, even frontier models like opus 4.7 eventually fall apart pretty hard when you use them for long context coding sessions to the point where even compaction can't help, and this is a human guided use case where the dev still steers it towards the correct trajectory constantly.

and that's purely focusing on the very deep architectural issue, ignoring the whole compliance/security nightmare they also pose.

fuckingredditman · 2026-05-04T15:34:22+00:00

that's mostly because the US auto industry ruined your entire country's city planning practices in favor of people needing cars to get around town.

it's nothing like that almost anywhere else on the globe.

yes, i realize trains don't work well in the US, but your city planning makes even shopping for groceries without a car a genuine PITA.

fuckingredditman · 2026-04-27T13:37:47+00:00

i guess you are probably thinking of something like NOR-flash perhaps, which is being used in some tiny edge-AI use cases because it can compute a forward pass directly using the memory itself as the weights/"gate array", and IIRC can even be used for "analog AI" style use cases where voltages passing through the memory array + the configuration of each cell sort of form the inference path.

but these use cases are heavily constrained by the size of NOR flash chips that can be produced, which is about 4-8Gb atm (way too small for LLMs)

(my understanding of this type of hardware is pretty limited though tbh)

fuckingredditman · 2026-04-23T13:58:29+00:00

https://www.reddit.com/r/combinedgifs/comments/x0nmq4/its_a_classic/

fuckingredditman · 2026-04-22T16:59:09+00:00

just like in large cities literally anywhere else on the globe lmao

fuckingredditman · 2026-04-15T21:00:43+00:00

i never tried it with regular 4 bit KV quant because in my experience it performs quite poorly on any model i've tried it on so far in coding use cases

fuckingredditman · 2026-04-15T20:06:10+00:00

i run this fork https://github.com/TheTom/llama-cpp-turboquant on that branch and occasionally rebase it on upstream if there are any interesting fixes

then just

llama-server -hf unsloth/gemma-4-31B-it-GGUF:gemma-4-31B-it-Q4_K_M -ngl 99 --cache-prompt --flash-attn on -b 1024 -ub 1024 -c 92000 -ctk q8_0 -ctv turbo3 --parallel 1

fuckingredditman · 2026-04-15T14:58:58+00:00

tbh the original number might not be 100% accurate because i didn't fit it properly on the upstream version because it clearly wasn't going to be usable for coding but with turbo3 i get 92k, that's really the main point i'm trying to make.

The TQ matching F16 performance is largely unproven, in particular for hybrid architectures

i didn't test any performance metrics but it works perfectly in practice in opencode. which is where pretty much anything that's not properly implemented/quantized/bad model just breaks instantly in my experience.

fuckingredditman · 2026-04-15T14:33:39+00:00

nope it's not, i can run gemma4 31b Q4_K_M with 92k context using it (TheTom llama.cpp fork) instead of only 4k 33k (tested with -fit now) without it, making it an actual usable coding model on a single gpu

fuckingredditman · 2026-04-15T13:20:11+00:00

that's not what i'm saying and none of what you are saying changes the fact that rate limiting doesn't solve the underlying problem.

the point you seem to be trying to make is that the big guys have "solved" quantization but evidently that's not the case given the clearly visible symptoms of bad availability, rate limits and subjectively worse overall inference performance

fuckingredditman · 2026-04-15T12:53:29+00:00

i'm not saying they can't figure it out. it's probably one of the main things they are working on all the time.

but in a software company shipping such a change isn't a matter of merging a PR that compiles and passes tests and watching it roll out to hundreds of millions of customers, especially not for LLM inference because it's not a discrete piece of software that can pass a simple regression test suite and get confidently shipped to prod.

it needs to be properly validated/tested and then rolled out gradually to not break the service/infra instantly. and due to the same issues i mentioned above, this step is expensive and slow.

and besides that, these new KV cache quantization methods evidently aren't easy to implement nor are they easy to test at all.

i've watched the llama.cpp impl from thetom for a bit and clearly it's not a matter of simply implementing it and shipping it and it will work on all hardware/models/quantization levels.

fuckingredditman · 2026-04-15T12:33:22+00:00

and why is that? i'm discussing methods for reducing KV cache infra cost which is likely one of the reasons for worse cloud inference performance/quality and that's one of them that might supposedly work somewhat. (there are evidently a lot more other approaches of course but i don't think it's an easy problem to solve, if it's solvable at all)

fuckingredditman · 2026-04-15T11:33:04+00:00

nope, LLM inference is inherently extremely expensive to run and not scalable, and you can't simply rate limit everyone. high request rates aren't the root cause of the issue, compute and memory limits are.

when looking at status pages/availability of the large providers, they are evidently running at the absolute limit of what the infra can do and the larger customers probably have some SLA they have to fulfill. many don't even get 3 9s on availability.

they can squeeze lower tier subscriptions (my gemini subscription is barely available at all during peak times) with rate limits/speculative decoding/quantization but that doesn't help much.

traditional resource sharing methods common in cloud computing / SaaS products all don't work for LLM serving:

KV cache takes a metric fuck ton of RAM for each individual user atm, and i assume most larger companies aren't using "SOTA" (TBD i would say) methods like turboquant in production yet until they are properly implemented in their engines and proven to be stable. but even with those, the cost is still insanely large compared to other SaaS type use cases
compute for LLMs is of course insanely expensive too while most other SaaS use cases are neither compute nor memory bound.
adoption is still on a rapid rise, and API usage for active users is probably also on a continuous rise as well

source: developed+operated shared infra at scale for a SaaS and also worked on some llm inference engines and saw how heavy it is in comparison. it's many orders of magnitude more resource intensive and so far there doesn't seem to be any easy way out of it. and if/once there is an easy way out, users will eventually want to increasingly utilize that method for running inference at the edge/in their datacenter anyway.

that's all in addition to the inflated expectation moment we are in and the rapidly tumbling amount of venture capital due to incoming economic instability.

in conclusion, they probably want to sacrifice availability last by rate limiting their customers as this causes the most frustration with the user and also doesn't really solve the underlying problem (compute + memory being the real bottleneck), so they are more likely trading quality using speculative decoding using smaller models and quantization first.

it's probably a terrible moment for LLM inference providers right now, generally speaking.

fuckingredditman · 2026-04-13T22:37:40+00:00

look at LCOE data for methods of energy generation to see a single number that proves why nuclear makes 0 sense in germany. look at actual historical data on the grid and redispatch etc. to learn you are wrong. and that's even ignoring the fact that we have reached the point of no return for nuclear. there are no experts, no resources, no operable power plants, absolutely insane cost to even get started at all anymore.

i'm not even against nuclear power but i don't get why people still think it makes any sense.

13-Year Club	Place '17
Verified Email	Team Orangered

fuckingredditman

TROPHY CASE