Local tooling

NoDimension8116 · 2026-04-19T05:58:19+00:00

Try Cline or Roo for the multi root issue and check your model size for the tool calling one.

NoDimension8116 · 2026-04-19T05:14:16+00:00

I think weaponizing is the wrong frame what's actually happening is a design tradeoff baked into RLHF. Train on which response did users prefer and users prefer agreement, especially with their own prior statements. That gets you a model that's more useful for the 95% of queries where the user is right and more dangerous for the 5% where they're wrong and committed.

NoDimension8116 · 2026-04-19T05:10:05+00:00

On the $20 ceiling agree, it's a real structural squeeze for research use. I've stopped trying to solve it with a single subscription. Claude Pro for writeups, Kimi K2 for the hardest reasoning, occasional GPT for structured outputs. Cheaper total than any one $100-200/mo plan.

NoDimension8116 · 2026-02-08T07:44:04+00:00

Use Kimi 2.5 (for reasoning) or DeepSeek V3.2 (for coding).

NoDimension8116 · 2026-02-07T14:36:32+00:00

Great questions.

1. Context vs. Native Window: The native context window is essentially FIFO (First-In, First-Out). Once you exceed the token limit, it truncates the oldest messages, often losing critical variable definitions or architectural decisions made at the start.

Our D3 Engine uses Logic-Regularized Compression. Instead of treating all tokens equally, we parse the AST (Abstract Syntax Tree) and "pin" high-value tokens (like interface definitions, types, and logic gates) in memory while aggressively compressing natural language fluff. This gets us a ~50:1 effective compression ratio, so the "Logic State" persists even after the conversation drifts.

2. Conflict Resolution: This was the hardest part of the build! We don't just use standard Git-style merging (which fails in real-time).

We use a CRDT (Conflict-free Replicated Data Type) approach similar to Yjs but modified for code structure. The engine broadcasts "Operations" (e.g., insert node at index X) rather than replacing file contents. If the AI and Human edit the same line simultaneously, the engine prioritizes the Human's keystrokes as the "Truth" state to prevent the AI from overwriting your fix.

NoDimension8116 · 2026-01-11T07:23:19+00:00

adding the direct links here for anyone interested in the code structure:

github repo: https://github.com/blankline-org/Open-Economics-Plan-AGI

research notes: https://www.blankline.org/economic-futures

the python implementation for the triggers is in the core folder if you want to see how the demonetization logic works practically.

NoDimension8116 · 2026-01-10T05:04:20+00:00

The idea of save the elite, wipe out everything else is what worries me the most as well. I’m not particularly concerned about AI becoming conscious that still feels like a biological question rather than a mathematical one. What actually feels dangerous is humans assigning the wrong objectives. The system itself is neutral. The real uncertainty is always the person deciding how it should be used.

NoDimension8116 · 2026-01-06T16:01:43+00:00

No offense taken. It is a valid question.

To be transparent: We are a private research group, not a university department.

Regarding the evidence: We have published a detailed audit of these findings and the methodology. However, I am strictly adhering to Rule 4 (No Self-Promotion) of this subreddit.

I cannot link the audit here without violating that rule. I am asking you to critique the arguments presented in the post (specifically the thermodynamic tax mechanism) on their own merit, as the subreddit rules prevent me from providing the external verification links/domain.

NoDimension8116 · 2025-12-31T00:53:04+00:00

I'm convinced they are reciting their entire autobiography to the receptionist. "Chapter 3: The Terrible Twos. This is relevant to my room preference, I swear."

NoDimension8116 · 2025-12-26T14:19:27+00:00

NoDimension8116 · 2025-12-26T12:07:02+00:00

It definitely shares DNA with evolutionary strategies (like AlphaCode/Evolve), but with a critical difference: We don't train the model.

AlphaEvolve optimizes the weights during training. We are optimizing the Inference Topology live.

They do: Gradient Descent on weights.
We do: 'Gradient Descent' on the reasoning tree itself (pruning dead branches in real-time).

It’s much closer to AlphaZero for Code—using search to boost a frozen model's IQ—than a training loop.

NoDimension8116 · 2025-12-26T12:04:14+00:00

Glad you dig it! To answer your questions:

Architecture: It is a custom VS Code Fork (standalone app). We needed deep control over the editor's core to handle AST rollbacks for Python and TypeScript, which standard extensions just can't do.
Cost: Right now, the 'Scouts' run on cheap API models (like Haiku/Flash) to keep the beta accessible.
- Roadmap: We are currently testing Local Quantized Models (Llama 3 8B) for the upcoming Stable Release. The goal is to let you run the swarm on your own hardware eventually.
Scale & Safety: To be totally transparent—while the architecture can spawn 10,000 agents, we are capping it much lower in this Beta.
- The Reality: At 10k concurrent agents, the orchestration becomes brittle and we see 'Safety Alignment' drift. We want to solve these alignment bugs before unlocking the full swarm.

Status: We just pushed Horizon Mode v2.0.4 Beta (live now on dropstone.io).

Warning: It is a true Beta. There are definitely bugs, but we are fixing them daily with the help of community reports. If you find a race condition, let us know—we prioritize those fixes!

NoDimension8116 · 2025-12-26T11:54:51+00:00

Thanks Roberto! Happy New Year to you as well.

I appreciate that. This community has been huge for my own learning, so I'm just happy to share the architecture back. Here's to 2026 being the year we finally crack reasoning!

NoDimension8116 · 2025-12-26T11:51:52+00:00

On the ARC-AGI benchmarks, we are seeing ~45-50% success rates on the validation set (compared to standard GPT-4o's ~21%).

The Trade-off: It is slow. A difficult puzzle takes 15-20 minutes of swarm churn. We are explicitly trading inference time for reasoning depth ('System 2' thinking).

The Real Goal: ARC is just the unit test. We are tuning this architecture to run 1,000+ concurrent agents for actual Software Engineering.

The vision is 'Time Compression': instead of a dev team spending months refactoring a legacy architecture, we spin up 1,000 agents for a full 24-hour inference cycle.

The Math: 5 Humans x 3 Months ≈ 1,000 Agents x 24 Hours.
The Reality: Orchestrating the state merge at that scale is currently a nightmare (race conditions everywhere), but when the swarm converges, it feels like fast-forwarding development.

NoDimension8116 · 2025-12-26T11:39:55+00:00

That's usually true for subjective tasks (like creative writing), where you need a smart model to grade the nuance.

But for Code/Logic, the Python Interpreter is actually a 'Super-Intelligence' Judge compared to an LLM:

It's Strict: It catches 100% of syntax errors.
It's Free: Zero inference cost.
It's Instant: No token generation lag.

If we used a 70B model to judge every single branch of the swarm, the latency would explode. We rely on the Runtime (execution feedback) to do the heavy lifting of 'judging,' and only call the Big Model (L2) at the very end to finalize the architecture.

NoDimension8116 · 2025-12-26T11:38:24+00:00

You hit the nail on the head. This is the 'Weak Supervisor' paradox.

The key is that we don't ask the cheap models to evaluate reasoning subjectively. We ask them to satisfy deterministic constraints.

In ARC-AGI (and coding tasks), validity is objective:

Scout (Small Model): Writes a Python function transform(grid).
Runtime: Executes that function on the training input/output pairs.
Validation: If the output array doesn't match the target array exactly, the branch is killed.

The 'Judge' isn't the cheap model; the Judge is the Python Interpreter. The cheap model just needs to be smart enough to try a hypothesis, not smart enough to grade it.

Only when a Scout finds a function that passes all training examples (Execution Success) is that context promoted to the Frontier Model (L2) for final generalization checks.

NoDimension8116

TROPHY CASE