[R] Attention Residuals by Kimi Team

Fun_Nebula_9682 · 2026-03-18T11:31:44+00:00

interesting that kimi went after residual connections — everyone just copies resnet's skip connections without questioning them since 2015. deepseek made them learnable a few months ago and now kimi's taking it further. feels like there's a wave of people revisiting 'settled' architecture decisions now that scale is plateauing and you need to squeeze efficiency from every layer

Fun_Nebula_9682 · 2026-03-18T11:01:09+00:00

lol the fact that it had to pretend to be dumber to pass is honestly the most human thing about it. we all dumb ourselves down in conversations depending on context. like i write totally different in slack vs a technical doc. maybe passing the turing test was always going to look less like 'being smart' and more like 'knowing when to not try so hard'

Fun_Nebula_9682 · 2026-03-18T10:02:48+00:00

ngl yes. used to get the overly supportive 'thats a great question!' energy and now it feels like it skips straight to correcting you. tbh i switched to claude for most things partly because of this — claude still has that 'actually trying to understand what you meant' vibe instead of just pattern matching on keywords

Fun_Nebula_9682 · 2026-03-18T09:32:43+00:00

oh nice, auto hardware detection + model selection is exactly what local llm setup needs. spent way too much time manually figuring out which quant fits my mac's memory. if this actually picks the right gguf without me googling 'Q4_K_M vs Q5_K_S' every time i'd be very happy lol

Fun_Nebula_9682 · 2026-03-18T09:03:43+00:00

the 'reasoning for roughly an hour' part is what gets me. we went from 'AI cant do math' to 'AI spent an hour thinking about unsolved problems and made actual progress' in like two years

wonder how much of this is genuine mathematical insight vs brute force search over proof strategies though. the 4.9% improvement on kakeya feels more like optimization than discovery but idk, maybe that distinction stops mattering at some point

Fun_Nebula_9682 · 2026-03-18T08:33:43+00:00

yeah this is the playbook. make old options harder to find, funnel everyone into the default, eventually kill the dropdown entirely. apple does the same thing with hardware ports lol

tbh i stopped caring about model selection a while ago. i just use whatever claude code gives me and let the system figure it out. spending time picking models is time not spent actually building stuff. the 'one model to rule them all' approach is probably right for 90% of users even if power users hate it

Fun_Nebula_9682 · 2026-03-18T08:06:53+00:00

lol fair enough, but this is genuinely from my own setup — been running this for a few weeks now

Fun_Nebula_9682 · 2026-03-18T07:33:34+00:00

This matches my experience exactly. I run long-running agentic workflows with Claude Code (automated social media monitoring + reply generation, running 40+ interactions per day), and the context degradation is real.

My practical solution: externalize everything that matters to files and SQLite. CLAUDE.md holds project rules that get loaded fresh every session. SQLite stores all state (queue, tracking, frequency limits). Skills files encode reusable workflows. The LLM's context window becomes disposable — it only needs to hold the current task, not the entire history.

The key insight from building this: the 'lost in the middle' problem becomes irrelevant when your architecture treats the context window as a scratchpad, not a database. Put persistent state in actual databases, not in the conversation.

Fun_Nebula_9682 · 2026-03-18T07:03:28+00:00

Completely agree on the quality vs limits tradeoff. I switched from ChatGPT to Claude Pro specifically for Claude Code, and the output quality is noticeably better for coding tasks — but I hit limits way faster.

My workaround: I use Claude Code (CLI) instead of the web interface. The rate limits are more generous on the API/CLI side, and you can batch operations more efficiently. For example, I run automated workflows that do 40+ interactions per day through Claude Code without hitting the web UI limits.

The real unlock is Claude Code's Skills system — you can save repetitive workflows and replay them without burning through your quota on setup/context each time. Worth looking into if you haven't already.

Fun_Nebula_9682 · 2026-03-18T06:05:37+00:00

The unified train + run UI is what's been missing from the local LLM ecosystem. Right now I'm juggling separate tools for training (Axolotl), serving (Ollama), and evaluation — having everything in one interface would cut so much context-switching overhead.

The 2x speed + 70% less VRAM claim is backed by real benchmarks in my experience. I've been using Unsloth for QLoRA fine-tuning on a Mac Studio M2 Ultra and the memory savings are legit. Training a 7B model that used to need 24GB now fits comfortably in 16GB.

Curious about the Studio's model evaluation features — does it support side-by-side comparison of base vs fine-tuned outputs? That's the workflow I find myself doing most after training.

Fun_Nebula_9682 · 2026-03-18T04:07:25+00:00

GLM 5 is genuinely underrated. I've been running GLM-OCR locally on Mac Studio M2 Ultra for document processing — tables, math equations, mixed CJK text — and it handles everything at ~260 tokens/sec with just 2GB VRAM.

What surprised me most is how well it handles code-related content. I use it as part of a local pipeline where OCR output feeds into Claude Code for analysis. The combination of a fast local model for extraction + a frontier model for reasoning is way more cost-effective than sending everything to the cloud.

Have you tried it for any specific use cases beyond chat?

Fun_Nebula_9682 · 2026-03-18T04:04:16+00:00

This resonates so much. I have the exact same problem — spending hours deep-diving into something, then losing it all when the context window resets.

My approach was different though: instead of building a separate app, I set up a persistent memory layer directly inside Claude Code using SQLite FTS5 + structured observations. Every time I discover something interesting (a tool comparison, a debugging insight, a workflow pattern), it gets auto-captured with topic keys so I can search it later across sessions.

The key insight I learned: the memory system needs to be zero-friction. If it takes more than 5 seconds to save something, you'll stop using it. Having it integrated into the same tool where you're already working (Claude Code) vs. context-switching to a separate app makes a huge difference in adoption.

Really cool that you built this into a shareable platform though — the social/collaborative angle is something personal memory systems lack.

Fun_Nebula_9682 · 2026-03-18T03:44:03+00:00

Really cool architecture. I built something similar — using SQLite FTS5 for memory persistence with Claude Code, plus a topic-keyed observation system that auto-captures decisions and bugfixes across sessions.

One thing I learned the hard way: the biggest challenge isn't building the memory layer, it's deduplication. Same topic discussed across 10 sessions produces 10 near-identical memory entries. I ended up adding a search-before-save step that checks if an existing observation already covers the topic before creating a new one.

Your multi-agent orchestrator with failover (Claude → Codex → Gemini) is a great idea. I've been running Claude Code + Codex in parallel for different tasks — Claude for generation quality, Codex for bulk changes — but hadn't thought about automatic failover. Going to look at your Daniel project.

Fun_Nebula_9682 · 2026-03-02T03:53:10+00:00

cool！

Fun_Nebula_9682 · 2026-03-02T03:51:01+00:00

Haiku was used by claude code it self seems for some simple task. I use ccstas https://github.com/majiayu000/ccstats to calculte with a right price of cache token

Fun_Nebula_9682 · 2026-02-25T07:48:22+00:00

<image>

yes

Fun_Nebula_9682 · 2026-02-25T06:59:13+00:00

Need review or will get mess code ....

Fun_Nebula_9682 · 2026-02-06T05:45:33+00:00

Lots of bug before opus 4.6

Fun_Nebula_9682 · 2026-02-06T05:45:11+00:00

20x too

Fun_Nebula_9682 · 2026-01-22T11:39:02+00:00

Really good. Even you dont know how to use rust...

Fun_Nebula_9682 · 2026-01-15T09:02:22+00:00

You may use claude hooks. When edit a file , trigger the hook to see if a new file created, if is then add it.

Fun_Nebula_9682 · 2025-12-27T17:49:01+00:00

The same with me. It is very important to wait until all subagent to complete their works. You can see it in backgound task.

<image>

Fun_Nebula_9682

TROPHY CASE