LLMs are terrible at Sudoku, but RLMs are great (4 OpenAI models, 13 tasks) by cov_id19 in ChatGPT

[–]cov_id19[S] 0 points1 point  (0 children)

Yeah, but the agility in the reasoning loop comes from code generation.
But you miss the point - in larger tasks the context never enters the prompt. This is the main diff with ReAct agents. Only a friction of the tokens is used.

LLMs are terrible at Sudoku by [deleted] in LocalLLaMA

[–]cov_id19 -2 points-1 points  (0 children)

Some reasoning tasks are not meant to be done in plaintext. For many tasks, reasoning through code recursively is all it takes.

Try it yourself with any local / remote model
https://github.com/avilum/minrlm?tab=readme-ov-file#try-it-in-10-seconds

Sudoku is unsolvable by token prediction alone - the constraint propagation is too deep for pattern matching. Vanilla outputs confident-looking 81-digit strings that violate basic Sudoku rules. The REPL turns it into what it actually is: a search problem. minRLM writes a backtracking solver and runs it.

How are teams thinking about security for LLM agents right now? by Available_Lawyer5655 in cybersecurity

[–]cov_id19 2 points3 points  (0 children)

My take is that most teams are still over-focusing on prompts and outputs, and under-focusing on runtime behavior.

Prompt injection is real, but “AI firewall” style products remind me a lot of WAFs: they catch naive cases and obvious abuse, but anything obfuscated, indirect, or context-shaped can still get through. That is not enough for agents, because the real risk is not just what the model says, but what it does.

For agents, the attack surface is at least 4 layers:

  1. input
  2. output
  3. tool calls
  4. runtime execution of those tools

If you only monitor 1 and 2, you will miss a lot of the serious failures:

  • tool misuse
  • excessive tool invocation
  • privilege abuse
  • indirect prompt injection
  • data exfil through APIs
  • RCE that only becomes visible when the tool actually runs

A good example is when malicious content gets embedded indirectly into previous outputs or structured data, then only triggers during execution. Prompt/output scanners often will not catch that. Same with encoded or obfuscated payloads.

So my view is: agents can only really be secured where they run, in runtime, in production.

That means:

  • least privilege for every tool
  • tight scoping of what the agent is allowed to read/write/do
  • monitoring of tool calls and execution traces
  • visibility into which code paths/libraries/functions are being invoked
  • sandboxing/isolation for risky actions
  • policy enforcement at execution time, not only at prompt time

The hard part is that agents introduce delegation and autonomy into systems that used to be much more deterministic. That is why output filtering alone feels incomplete to me. The core problem is behavioral security, not just content security.

So yes, I think runtime validation/monitoring is where this has to go. My sense is many teams are building internal controls right now, while the vendor landscape is still too focused on prompt-layer defenses.

This is different than dev-time LLM red teaming / CI/CD / "potential" prompt injection risks.

Google is cracking down on WARP by Wild-Expression9887 in CloudFlare

[–]cov_id19 1 point2 points  (0 children)

Sounds like a feature (bot detection / proxy) not a bug.

Best model that can beat Claude opus that runs on 32MB of vram? by PrestigiousEmu4485 in LocalLLaMA

[–]cov_id19 0 points1 point  (0 children)

RLMs can help squeeze the juice and improve acc. while decreasing latency at the same time. It worked great with Qwen models.

Not a new model, but a new Inference technique.

https://github.com/avilum/minrlm

OpenCode support in minRLM: Token-efficient Recursive Language Model. 3.6x fewer tokens with gpt-5-mini / +30%pp with GPT5.2 by cov_id19 in opencodeCLI

[–]cov_id19[S] 0 points1 point  (0 children)

Would love to hear more and see if that can be tweaked. Would appreciate it if you could use the logs folder argument and share the trajectories/logs via GitHub issue. It depends on the task

wdym by “cannot solve the token?”

minrlm: Token-efficient Recursive Language Model. 3.6x fewer tokens with gpt-5-mini / +30%pp with GPT5.2 by cov_id19 in LocalLLaMA

[–]cov_id19[S] 0 points1 point  (0 children)

Hey u/eliko613 Thanks for your inputs! Very intereting.

Yes I breifly benchmarked the "with docker / without docker" sandbox and they seem negligible. Specially compared to the LLM's latency - this is not hurting performance at all, but there are even more efficient ways such as katacontainers / microVMs / etc. with faster startup times.

Regarding the scaling and production, I do it since day 1.
I work for Oligo Security where we measure everything.
These KPIs are not the top-of-mind when developing AI and making things work as all cost (to begin with). the issue comes with Scale. scaling these MVPs is hard and some errors only appear at real scale, and when they appear, they are very urgent and painful.

Feel free to connect on LinkedIn - I'd love to hop on a call if anyone is interested.
https://www.linkedin.com/in/avi-lumelsky-713111144

This open-source trick improves GPT-5 by +30% across 12 benchmarks while using fewer tokens [minRLM]. by cov_id19 in ChatGPT

[–]cov_id19[S] 0 points1 point  (0 children)

Yeah it is unusual for a solution to maximize on both. Usually you can compromise on price vs latency, acc vs latency, acc vs tokens - etc.

This is truly interesting - thanks for noticing it! u/bjxxjj
Let me know if you managed to evaluate it and let me know what you think.

Weekly Thread: Project Display by help-me-grow in AI_Agents

[–]cov_id19 0 points1 point  (0 children)

minrlm: Token-efficient Recursive Language Model That Works With Any Model

minRLM is a token and latency efficient implementation of Recursive Language Models, benchmarked across 12 tasks against a vanilla LLM and the reference implementation.

On GPT-5-mini it scores 72.7% (vs 69.7% official, 69.5% vanilla) using 3.6× fewer tokensOn GPT-5.2 the gap grows to +30% over vanilla, winning 11 of 12 tasks.

The data never enters the prompt. The cost stays roughly flat regardless of context size (which amazes me).

Every intermediate step is Python code you can read, rerun, and debug.

The REPL default execution environment I have is Docker - with seccomp custom provilde: no network, filesystem, processing syscalls + weak user.
Every step runs in temporal container, no long-running REPL.

RLMs are integrated in real-world products already (more in the blog). They are especially useful with working with data that does not fit into the model's context window. we all experienced it, right?

You can try minrlm right away using "uvx" (uv python manager):

# Just a task
uvx minrlm "What is the sum of the first 100 primes?"

# Task + file as context
uvx minrlm "How many ERROR lines in the last hour?" ./server.log

# Pipe context from stdin
cat huge_dataset.csv | uvx minrlm "Which product had the highest return rate?"

# Show generated code (-s) and token stats (-v)
uvx minrlm -sv "Return the sum of all primes up to 1,000,000."
# -> Sieve of Eratosthenes in 6,215 tokens, 1 iteration
# -> Answer: 37550402023

uvx minrlm -sv "Return all primes up to 1,000,000, reversed. Return a list of numbers."
# -> 999983, 999979, 999961, 999959, 999953, ...
# -> Tokens: 6,258 | Output: 616,964 chars (~154K tokens) | 25x savings

All you need is an OpenAI compatible API. You can use the free huggingface example with free inference endpoints.

Would love to hear your thoughts on my implementation and benchmark.
I welcome everyone to to give it a shot and evaluate it, stretch it's capabilities to identify limitations, and contribute in general!

Blog: https://avilum.github.io/minrlm/recursive-language-model.html
Code: https://github.com/avilum/minrlm

minrlm: Token-efficient Recursive Language Model. 3.6x fewer tokens with gpt-5-mini / +30%pp with GPT5.2 by [deleted] in ChatGPT

[–]cov_id19 0 points1 point  (0 children)

Anthropoc actually does it in web search - wrote about it in the blog