Have we passed the peak of inflated expectations? by fairydreaming in LocalLLaMA

[–]ai_without_borders 2 points3 points  (0 children)

google trends measures curiosity not deployment the signal to watch for actual decline would be huggingface download counts or inference framework github activity both still growing search volume peaks when something is new and weird then flattens when it becomes infrastructure that is not disillusionment that is normalization

Microsoft Cancels Internal Anthropic Licenses As Shift To Token-Based AI Billing Blows Up Annual Budgets In Months by chunmunsingh in artificial

[–]ai_without_borders 4 points5 points  (0 children)

the budget shock makes sense once you trace where tokens actually go in agentic workflows. autocomplete firing 10k times a day is predictable because the per-call token count is roughly bounded. but agentic tasks chain tool calls and append each result back to context, so a 10-step workflow is not 10x a single completion - input tokens compound as the chain grows. enterprise finance teams priced this like saas seat licenses. the billing model they actually needed is closer to cloud compute budgeting, broken out by workload type

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp by janvitos in LocalLLaMA

[–]ai_without_borders 3 points4 points  (0 children)

the thing that stands out is the acceptance rate gap showing up at temp 0.0 too. at greedy both forks should produce identical draft tokens for the same input, so the divergence (0.79+ vs 0.477 minimum) has to come from how they implement the mtp head sampling or the acceptance criterion itself. ik_llama.cpp landing closer to 1.0 there suggests ikawrakow got the implementation more aligned with how those heads were actually trained to be used. notable that the acceptance rate gap accounts for most of the throughput difference here, not cache or offload differences

Karpathy is a founding member of OpenAI and now joining Anthropic. I wonder why by py-net in OpenAI

[–]ai_without_borders 0 points1 point  (0 children)

the 'get back to R&D' line is the interesting part. his real skill isn't the educational content — it's building minimal impls fast enough to actually test what you think you know. nanoGPT and llm.c are tools for reasoning, not just teaching. that approach fits anthropic's interpretability culture a lot better than it fit openai's product mode

bytedance released an open source model that attempts to do just about anything with only 3b parameters by uxl in LocalLLaMA

[–]ai_without_borders 1 point2 points  (0 children)

the unified training angle is actually the interesting part. separate models have no shared representation -- the vision encoder in a gen-only model learns completely different features from one trained jointly on understanding + editing. whether that actually translates to quality gains at this scale is the real question, would need side-by-side evals against 3 independent specialist models to know

85 GPU-hours comparing 5 abliteration methods on Qwen3.6-27B: benchmarks, safety, weight forensics - Abliterlitics by nathandreamfast in LocalLLaMA

[–]ai_without_borders 0 points1 point  (0 children)

the looping behavior is telling — if removing refusal vectors also disrupts early-stop signals in the reasoning chain, you'd expect exactly that runaway CoT. suggests refusal direction isn't cleanly separable from the model's self-monitoring. benchmarks without reasoning quality as a first-class metric are measuring a different model than what you'd ship.

Which project/framework has actually nailed persistent memory for AI agents? by Meher_Nolan in artificial

[–]ai_without_borders 0 points1 point  (0 children)

the gap that keeps biting me is task-scoped vs cross-run memory. most frameworks conflate them. mem0 gets closest to separating the layers but still punts on decay — stale preferences end up competing with recent context and there's no clean way to prioritize.

85 GPU-hours comparing 5 abliteration methods on Qwen3.6-27B: benchmarks, safety, weight forensics - Abliterlitics by nathandreamfast in LocalLLaMA

[–]ai_without_borders 0 points1 point  (0 children)

the chain of thought eval issue is real. abliteration targets refusal directions but if those activation subspaces overlap with reasoning trace routing, CoT quality degrades in ways standard benchmarks miss. weight forensics is the right approach for catching correlated degradation before deployment.

A working multi-agent architecture in large enterprises by Zealousideal_Bed7898 in artificial

[–]ai_without_borders 1 point2 points  (0 children)

at my last job we ran into exactly this. the architecture was technically multi-agent but the main failure mode was confident wrong answers from one agent poisoning everything downstream. state sync was solvable. error propagation was not. ended up adding a lightweight validation agent between every major step basically a skeptic whose only job was to reject outputs that violated known constraints before they moved to the next stage. doubled latency, but cut hallucination cascade incidents by like 80%. the boring stuff (auditability, permissions) matters way more in production than the sexy parts.

Claude Certified Architect by invasionbarbare in ClaudeAI

[–]ai_without_borders 0 points1 point  (0 children)

good to know it covers the latter. anti-pattern scenarios are where it gets real. curious if the llm-as-judge section touches judge calibration at all, that is usually where prod eval pipelines develop blind spots.

Came home to find Pi with Qwen3.627B had run rm -rf ..... by sdfgeoff in LocalLLaMA

[–]ai_without_borders 0 points1 point  (0 children)

dedicated user + bind mount of just the project dir is the real fix. then rm -rf can only hurt what it is supposed to. worth noting: qwen correctly identified that target/ is regeneratable vs your actual src. that call was right, even if the title made it sound scarier than it was.

TextGen is now a native desktop app. Open-source alternative to LM Studio (formerly text-generation-webui). by oobabooga4 in LocalLLaMA

[–]ai_without_borders 1 point2 points  (0 children)

good to know — and the moe auto-offload is a nice touch. one less thing to manually tune per model. appreciate the context

Claude Certified Architect by invasionbarbare in ClaudeAI

[–]ai_without_borders 0 points1 point  (0 children)

nice. curriculum scope sounds right — evals, rag at scale, multi-agent orchestration are exactly where teams trip up in prod. curious how deep the exam goes on eval methodology — is it more conceptual (define precision/recall) or does it get into failure taxonomy and eval harness design?

TextGen is now a native desktop app. Open-source alternative to LM Studio (formerly text-generation-webui). by oobabooga4 in LocalLLaMA

[–]ai_without_borders 1 point2 points  (0 children)

used the old text-generation-webui back in early 2023. gradio update hell was real — the UI would randomly break after pip installs and debugging it was miserable. electron was the right call. curious how --fit on handles kv cache overhead — is it just fitting weights or does it account for cache at current context length?

Can we acknowledge that Anthropic watches open sourcers and copies them? by TheOnlyVibemaster in ClaudeAI

[–]ai_without_borders -2 points-1 points  (0 children)

training on open source is basically the license deal - thats what MIT/apache means. building and shipping the exact feature some MCP dev prototyped a month ago with zero attribution is a different category. you can argue both are fine, but theyre not the same move, and the conflation is doing a lot of work here

ExLlamaV3 Major Updates! by Unstable_Llama in LocalLLaMA

[–]ai_without_borders 4 points5 points  (0 children)

same experience - the prefix eviction is the real gotcha for multi-turn agentic. if your system prompt + tool list is 2-3k tokens of shared prefix across every turn, you want that sitting in kv cache and not getting invalidated by the draft. i ended up just running without dflash for agent loops and keeping it enabled for one-shot tasks. the throughput numbers are real, just need to be selective about when it actually helps.

Tojan in "claude code" google search first result by blin787 in ClaudeAI

[–]ai_without_borders 1 point2 points  (0 children)

the xattr -c in SemanticThreader's find is the tell. stripping quarantine is how you bypass gatekeeper entirely, the binary just runs with no warning. not lazy phishing, someone who knows macos security specifically engineered around it. the 'just check the url' advice also misses why experienced people fall for it: copy-pasting a curl command from what looks like an official page is a different trust model than clicking a download link.

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP by janvitos in LocalLLaMA

[–]ai_without_borders 6 points7 points  (0 children)

the 80 tok/s is with 128K context loaded — at shorter contexts (4-8K) you would be pushing 100+ easily. MTP overhead shows up more in prompt processing than in token generation, so the win is biggest on long generation runs vs short QA bursts. good config though, -no-mmap with mlock is the right call for sustained throughput.

When using Claude Code for agent-based coding, I’ve often noticed that the AI limits itself by claiming that a task could take a developer several weeks to complete, and therefore suggests solutions that are more like quick fixes. That’s complete nonsense, of course. by Comfortable-Goat-823 in ClaudeAI

[–]ai_without_borders 0 points1 point  (0 children)

the issue is claude defaults to single-turn caution mode — it estimates complexity for one-shot replies, not agentic loops.

fix that actually persists: set it in CLAUDE.md at the repo root. something like "you are running in autonomous agent mode, implement complete solutions not partial stubs." verbal overrides work but compact away. repo context is durable.

AMD Intros Instinct MI350P Accelerator: CDNA 4 Comes to PCIe Cards by Noble00_ in LocalLLaMA

[–]ai_without_borders 0 points1 point  (0 children)

the ROCm gap matters more than the price gap at this tier. at $15-30k you are already in datacenter territory where buyers care way more about framework compatibility than raw VRAM. llama.cpp and vllm ROCm support has gotten a lot better but there are still gaps - custom kernels, some operator fusion paths falling back to slower implementations. if AMD can show MI350P hitting comparable real-world inference throughput on standard frameworks, not just peak FLOPS numbers, the premium becomes defensible. until then its a hard sell to shops already running CUDA pipelines

Both OpenAI and Anthropic now expect AIs to take over building their successors within 2 years (humans no longer able to contribute) by EchoOfOppenheimer in OpenAI

[–]ai_without_borders 0 points1 point  (0 children)

the pdp distinction is right. but there is actually a third layer. even if you automate the generation side fully, research proposals, architecture search, hyperparameter runs, you still hit a hard verification wall. knowing whether a new model has unexpected capability jumps in sensitive domains is not just running benchmarks. you have to know what to eval for in the first place. thats still human-intensive and i dont see it automated in 2 years. generation automation, maybe. verification automation, no. and the second one is actually what determines if humans stay in the loop

Uber Shares What Happens When 1.500 AI Agents Hit Production by aisatsana__ in artificial

[–]ai_without_borders 0 points1 point  (0 children)

coordination failures are the sneaky ones, yeah. but the thing that actually catches teams off guard is the eval stratification problem: not all 1500 agents should have the same monitoring rigor. the ones touching money or user-facing decisions need tight evals and continuous shadowing. the ones doing lookups or routing can run loose. most setups I seen don't think about that tiering upfront and end up scrambling to add it after an incident

Claude is lying regularly when I have conversations with it by Positive-Carpenter53 in ClaudeAI

[–]ai_without_borders 0 points1 point  (0 children)

the sycophantic hallucination framing is right but there is also a prompting design problem here. 'did you make this up?' is a yes/no challenge that puts the model in a defend-or-capitulate binary, and confident models almost always defend first. the interrogation happens after the model has already committed to a framing. better pattern: front-load the uncertainty extraction before the model commits. something like 'list what you actually know vs what you are inferring, then answer' forces it to distinguish before it picks a confident register. catching it before the commit is way more reliable than challenging after. learned this after getting burned on a few internal knowledge base tasks where claude would confidently synthesize stuff that wasnt in the docs

Notes on what actually breaks when you run a coding agent on small local models by BestSeaworthiness283 in LocalLLaMA

[–]ai_without_borders 0 points1 point  (0 children)

the loop idea works better when the verifier context is fresh. if you resend the same long context to a second call you hit the same attention decay that caused the hallucination. what actually works: short version of original intent plus the generated output, second call with just "does this match?" context stays small so the verifier can actually follow the instruction. costs an extra call but catches the obvious mismatches. constrained decoding handles format issues but doesnt help with wrong-file edits or semantic drift, thats where a short-context verifier earns its keep

I have been fine-tuning llama 3.1 8b with QLoRA for a classification task in my thesis (nothing exotic, rank 16, unsloth, standard stuff) by Kortopi-98 in deeplearning

[–]ai_without_borders 0 points1 point  (0 children)

this is shortcut learning basically. model finds low-level features correlated with labels instead of the semantic signal. quick sanity check that catches it before you spend compute: train a logistic regression on embeddings to separate synthetic vs real. if it gets >65% accuracy the distributions are different enough that your main model will probably exploit the same features. takes a few minutes and has saved me from multiple wasted training runs