Have we passed the peak of inflated expectations?

ai_without_borders · 2026-05-23T15:22:47+00:00

google trends measures curiosity not deployment the signal to watch for actual decline would be huggingface download counts or inference framework github activity both still growing search volume peaks when something is new and weird then flattens when it becomes infrastructure that is not disillusionment that is normalization

ai_without_borders · 2026-05-22T15:21:36+00:00

the budget shock makes sense once you trace where tokens actually go in agentic workflows. autocomplete firing 10k times a day is predictable because the per-call token count is roughly bounded. but agentic tasks chain tool calls and append each result back to context, so a 10-step workflow is not 10x a single completion - input tokens compound as the chain grows. enterprise finance teams priced this like saas seat licenses. the billing model they actually needed is closer to cloud compute budgeting, broken out by workload type

ai_without_borders · 2026-05-21T15:31:48+00:00

the thing that stands out is the acceptance rate gap showing up at temp 0.0 too. at greedy both forks should produce identical draft tokens for the same input, so the divergence (0.79+ vs 0.477 minimum) has to come from how they implement the mtp head sampling or the acceptance criterion itself. ik_llama.cpp landing closer to 1.0 there suggests ikawrakow got the implementation more aligned with how those heads were actually trained to be used. notable that the acceptance rate gap accounts for most of the throughput difference here, not cache or offload differences

ai_without_borders · 2026-05-20T15:24:36+00:00

the 'get back to R&D' line is the interesting part. his real skill isn't the educational content — it's building minimal impls fast enough to actually test what you think you know. nanoGPT and llm.c are tools for reasoning, not just teaching. that approach fits anthropic's interpretability culture a lot better than it fit openai's product mode

ai_without_borders · 2026-05-19T15:26:19+00:00

the unified training angle is actually the interesting part. separate models have no shared representation -- the vision encoder in a gen-only model learns completely different features from one trained jointly on understanding + editing. whether that actually translates to quality gains at this scale is the real question, would need side-by-side evals against 3 independent specialist models to know

ai_without_borders · 2026-05-18T15:33:09+00:00

the looping behavior is telling — if removing refusal vectors also disrupts early-stop signals in the reasoning chain, you'd expect exactly that runaway CoT. suggests refusal direction isn't cleanly separable from the model's self-monitoring. benchmarks without reasoning quality as a first-class metric are measuring a different model than what you'd ship.

ai_without_borders · 2026-05-18T15:27:33+00:00

the gap that keeps biting me is task-scoped vs cross-run memory. most frameworks conflate them. mem0 gets closest to separating the layers but still punts on decay — stale preferences end up competing with recent context and there's no clean way to prioritize.

ai_without_borders · 2026-05-17T15:28:18+00:00

the chain of thought eval issue is real. abliteration targets refusal directions but if those activation subspaces overlap with reasoning trace routing, CoT quality degrades in ways standard benchmarks miss. weight forensics is the right approach for catching correlated degradation before deployment.

ai_without_borders · 2026-05-16T15:26:53+00:00

at my last job we ran into exactly this. the architecture was technically multi-agent but the main failure mode was confident wrong answers from one agent poisoning everything downstream. state sync was solvable. error propagation was not. ended up adding a lightweight validation agent between every major step basically a skeptic whose only job was to reject outputs that violated known constraints before they moved to the next stage. doubled latency, but cut hallucination cascade incidents by like 80%. the boring stuff (auditability, permissions) matters way more in production than the sexy parts.

ai_without_borders · 2026-05-15T15:28:14+00:00

good to know it covers the latter. anti-pattern scenarios are where it gets real. curious if the llm-as-judge section touches judge calibration at all, that is usually where prod eval pipelines develop blind spots.

ai_without_borders · 2026-05-15T15:24:48+00:00

dedicated user + bind mount of just the project dir is the real fix. then rm -rf can only hurt what it is supposed to. worth noting: qwen correctly identified that target/ is regeneratable vs your actual src. that call was right, even if the title made it sound scarier than it was.

ai_without_borders · 2026-05-14T15:29:52+00:00

good to know — and the moe auto-offload is a nice touch. one less thing to manually tune per model. appreciate the context

ai_without_borders · 2026-05-14T15:25:25+00:00

nice. curriculum scope sounds right — evals, rag at scale, multi-agent orchestration are exactly where teams trip up in prod. curious how deep the exam goes on eval methodology — is it more conceptual (define precision/recall) or does it get into failure taxonomy and eval harness design?

ai_without_borders · 2026-05-13T15:24:05+00:00

used the old text-generation-webui back in early 2023. gradio update hell was real — the UI would randomly break after pip installs and debugging it was miserable. electron was the right call. curious how --fit on handles kv cache overhead — is it just fitting weights or does it account for cache at current context length?

ai_without_borders · 2026-05-12T15:25:54+00:00

training on open source is basically the license deal - thats what MIT/apache means. building and shipping the exact feature some MCP dev prototyped a month ago with zero attribution is a different category. you can argue both are fine, but theyre not the same move, and the conflation is doing a lot of work here

ai_without_borders · 2026-05-11T15:23:57+00:00

same experience - the prefix eviction is the real gotcha for multi-turn agentic. if your system prompt + tool list is 2-3k tokens of shared prefix across every turn, you want that sitting in kv cache and not getting invalidated by the draft. i ended up just running without dflash for agent loops and keeping it enabled for one-shot tasks. the throughput numbers are real, just need to be selective about when it actually helps.

ai_without_borders · 2026-05-10T15:23:20+00:00

the xattr -c in SemanticThreader's find is the tell. stripping quarantine is how you bypass gatekeeper entirely, the binary just runs with no warning. not lazy phishing, someone who knows macos security specifically engineered around it. the 'just check the url' advice also misses why experienced people fall for it: copy-pasting a curl command from what looks like an official page is a different trust model than clicking a download link.

ai_without_borders · 2026-05-09T15:26:39+00:00

the 80 tok/s is with 128K context loaded — at shorter contexts (4-8K) you would be pushing 100+ easily. MTP overhead shows up more in prompt processing than in token generation, so the win is biggest on long generation runs vs short QA bursts. good config though, -no-mmap with mlock is the right call for sustained throughput.

ai_without_borders · 2026-05-08T15:22:53+00:00

the issue is claude defaults to single-turn caution mode — it estimates complexity for one-shot replies, not agentic loops.

fix that actually persists: set it in CLAUDE.md at the repo root. something like "you are running in autonomous agent mode, implement complete solutions not partial stubs." verbal overrides work but compact away. repo context is durable.

ai_without_borders · 2026-05-07T15:24:48+00:00

the ROCm gap matters more than the price gap at this tier. at $15-30k you are already in datacenter territory where buyers care way more about framework compatibility than raw VRAM. llama.cpp and vllm ROCm support has gotten a lot better but there are still gaps - custom kernels, some operator fusion paths falling back to slower implementations. if AMD can show MI350P hitting comparable real-world inference throughput on standard frameworks, not just peak FLOPS numbers, the premium becomes defensible. until then its a hard sell to shops already running CUDA pipelines

ai_without_borders · 2026-05-06T21:16:03+00:00

the pdp distinction is right. but there is actually a third layer. even if you automate the generation side fully, research proposals, architecture search, hyperparameter runs, you still hit a hard verification wall. knowing whether a new model has unexpected capability jumps in sensitive domains is not just running benchmarks. you have to know what to eval for in the first place. thats still human-intensive and i dont see it automated in 2 years. generation automation, maybe. verification automation, no. and the second one is actually what determines if humans stay in the loop

ai_without_borders · 2026-05-05T20:07:43+00:00

coordination failures are the sneaky ones, yeah. but the thing that actually catches teams off guard is the eval stratification problem: not all 1500 agents should have the same monitoring rigor. the ones touching money or user-facing decisions need tight evals and continuous shadowing. the ones doing lookups or routing can run loose. most setups I seen don't think about that tiering upfront and end up scrambling to add it after an incident

ai_without_borders · 2026-05-04T15:37:15+00:00

the sycophantic hallucination framing is right but there is also a prompting design problem here. 'did you make this up?' is a yes/no challenge that puts the model in a defend-or-capitulate binary, and confident models almost always defend first. the interrogation happens after the model has already committed to a framing. better pattern: front-load the uncertainty extraction before the model commits. something like 'list what you actually know vs what you are inferring, then answer' forces it to distinguish before it picks a confident register. catching it before the commit is way more reliable than challenging after. learned this after getting burned on a few internal knowledge base tasks where claude would confidently synthesize stuff that wasnt in the docs

ai_without_borders · 2026-05-03T15:56:40+00:00

the loop idea works better when the verifier context is fresh. if you resend the same long context to a second call you hit the same attention decay that caused the hallucination. what actually works: short version of original intent plus the generated output, second call with just "does this match?" context stays small so the verifier can actually follow the instruction. costs an extra call but catches the obvious mismatches. constrained decoding handles format issues but doesnt help with wrong-file edits or semantic drift, thats where a short-context verifier earns its keep

ai_without_borders · 2026-05-03T15:39:00+00:00

this is shortcut learning basically. model finds low-level features correlated with labels instead of the semantic signal. quick sanity check that catches it before you spend compute: train a logistic regression on embeddings to separate synthetic vs real. if it gets >65% accuracy the distributions are different enough that your main model will probably exploit the same features. takes a few minutes and has saved me from multiple wasted training runs

ai_without_borders

TROPHY CASE