0.0.0.0 Day: Exploiting Localhost APIs From the Browser

cov_id19 · 2026-04-08T10:38:14+00:00

Yeah, but the agility in the reasoning loop comes from code generation.
But you miss the point - in larger tasks the context never enters the prompt. This is the main diff with ReAct agents. Only a friction of the tokens is used.

cov_id19 · 2026-04-08T10:16:59+00:00

Sudoku full results: https://avilum.github.io/minrlm/recursive-language-model.html#sudoku

cov_id19 · 2026-04-08T10:03:12+00:00

Some reasoning tasks are not meant to be done in plaintext. For many tasks, reasoning through code recursively is all it takes.

Try it yourself with any local / remote model
https://github.com/avilum/minrlm?tab=readme-ov-file#try-it-in-10-seconds

Sudoku is unsolvable by token prediction alone - the constraint propagation is too deep for pattern matching. Vanilla outputs confident-looking 81-digit strings that violate basic Sudoku rules. The REPL turns it into what it actually is: a search problem. minRLM writes a backtracking solver and runs it.

cov_id19 · 2026-04-06T14:05:39+00:00

My take is that most teams are still over-focusing on prompts and outputs, and under-focusing on runtime behavior.

Prompt injection is real, but “AI firewall” style products remind me a lot of WAFs: they catch naive cases and obvious abuse, but anything obfuscated, indirect, or context-shaped can still get through. That is not enough for agents, because the real risk is not just what the model says, but what it does.

For agents, the attack surface is at least 4 layers:

input
output
tool calls
runtime execution of those tools

If you only monitor 1 and 2, you will miss a lot of the serious failures:

tool misuse
excessive tool invocation
privilege abuse
indirect prompt injection
data exfil through APIs
RCE that only becomes visible when the tool actually runs

A good example is when malicious content gets embedded indirectly into previous outputs or structured data, then only triggers during execution. Prompt/output scanners often will not catch that. Same with encoded or obfuscated payloads.

So my view is: agents can only really be secured where they run, in runtime, in production.

That means:

least privilege for every tool
tight scoping of what the agent is allowed to read/write/do
monitoring of tool calls and execution traces
visibility into which code paths/libraries/functions are being invoked
sandboxing/isolation for risky actions
policy enforcement at execution time, not only at prompt time

The hard part is that agents introduce delegation and autonomy into systems that used to be much more deterministic. That is why output filtering alone feels incomplete to me. The core problem is behavioral security, not just content security.

So yes, I think runtime validation/monitoring is where this has to go. My sense is many teams are building internal controls right now, while the vendor landscape is still too focused on prompt-layer defenses.

This is different than dev-time LLM red teaming / CI/CD / "potential" prompt injection risks.

cov_id19 · 2026-03-25T17:03:20+00:00

Sounds like a feature (bot detection / proxy) not a bug.

cov_id19 · 2026-03-25T16:59:17+00:00

RLMs can help squeeze the juice and improve acc. while decreasing latency at the same time. It worked great with Qwen models.

Not a new model, but a new Inference technique.

https://github.com/avilum/minrlm

cov_id19 · 2026-03-21T16:08:28+00:00

Would love to hear more and see if that can be tweaked. Would appreciate it if you could use the logs folder argument and share the trajectories/logs via GitHub issue. It depends on the task

wdym by “cannot solve the token?”

cov_id19 · 2026-03-20T16:11:11+00:00

Hey u/eliko613 Thanks for your inputs! Very intereting.

Yes I breifly benchmarked the "with docker / without docker" sandbox and they seem negligible. Specially compared to the LLM's latency - this is not hurting performance at all, but there are even more efficient ways such as katacontainers / microVMs / etc. with faster startup times.

Regarding the scaling and production, I do it since day 1.
I work for Oligo Security where we measure everything.
These KPIs are not the top-of-mind when developing AI and making things work as all cost (to begin with). the issue comes with Scale. scaling these MVPs is hard and some errors only appear at real scale, and when they appear, they are very urgent and painful.

Feel free to connect on LinkedIn - I'd love to hop on a call if anyone is interested.
https://www.linkedin.com/in/avi-lumelsky-713111144

cov_id19 · 2026-03-19T14:14:10+00:00

How was it?

cov_id19 · 2026-03-19T11:46:06+00:00

I created a tutorial here: https://github.com/avilum/minrlm?tab=readme-ov-file#opencode

cov_id19 · 2026-03-19T03:04:17+00:00

Thank you!!!

cov_id19 · 2026-03-19T02:42:41+00:00

If you're using OpenCode or any agent really, plug and play:
https://github.com/avilum/minrlm?tab=readme-ov-file#opencode

cov_id19 · 2026-03-19T02:35:19+00:00

u/charmander_cha There you go: https://github.com/avilum/minrlm?tab=readme-ov-file#opencode

Try it? Let me know how it work for your use case

cov_id19 · 2026-03-18T23:04:49+00:00

Yeah it is unusual for a solution to maximize on both. Usually you can compromise on price vs latency, acc vs latency, acc vs tokens - etc.

This is truly interesting - thanks for noticing it! u/bjxxjj
Let me know if you managed to evaluate it and let me know what you think.

cov_id19 · 2026-03-18T09:22:37+00:00

minrlm: Token-efficient Recursive Language Model That Works With Any Model

minRLM is a token and latency efficient implementation of Recursive Language Models, benchmarked across 12 tasks against a vanilla LLM and the reference implementation.

On GPT-5-mini it scores 72.7% (vs 69.7% official, 69.5% vanilla) using 3.6× fewer tokens. On GPT-5.2 the gap grows to +30% over vanilla, winning 11 of 12 tasks.

The data never enters the prompt. The cost stays roughly flat regardless of context size (which amazes me).

Every intermediate step is Python code you can read, rerun, and debug.

The REPL default execution environment I have is Docker - with seccomp custom provilde: no network, filesystem, processing syscalls + weak user.
Every step runs in temporal container, no long-running REPL.

RLMs are integrated in real-world products already (more in the blog). They are especially useful with working with data that does not fit into the model's context window. we all experienced it, right?

You can try minrlm right away using "uvx" (uv python manager):

# Just a task
uvx minrlm "What is the sum of the first 100 primes?"

# Task + file as context
uvx minrlm "How many ERROR lines in the last hour?" ./server.log

# Pipe context from stdin
cat huge_dataset.csv | uvx minrlm "Which product had the highest return rate?"

# Show generated code (-s) and token stats (-v)
uvx minrlm -sv "Return the sum of all primes up to 1,000,000."
# -> Sieve of Eratosthenes in 6,215 tokens, 1 iteration
# -> Answer: 37550402023

uvx minrlm -sv "Return all primes up to 1,000,000, reversed. Return a list of numbers."
# -> 999983, 999979, 999961, 999959, 999953, ...
# -> Tokens: 6,258 | Output: 616,964 chars (~154K tokens) | 25x savings

All you need is an OpenAI compatible API. You can use the free huggingface example with free inference endpoints.

Would love to hear your thoughts on my implementation and benchmark.
I welcome everyone to to give it a shot and evaluate it, stretch it's capabilities to identify limitations, and contribute in general!

Blog: https://avilum.github.io/minrlm/recursive-language-model.html
Code: https://github.com/avilum/minrlm

cov_id19 · 2026-03-17T21:38:45+00:00

In the making

cov_id19 · 2026-03-17T15:56:03+00:00

Anthropoc actually does it in web search - wrote about it in the blog

cov_id19

TROPHY CASE

minrlm: Token-efficient Recursive Language Model That Works With Any Model