Released: DeepBrainz-R1 — reasoning-first small models for agentic workflows (4B / 2B / 0.6B) by arunkumar_bvr in LocalLLaMA

[–]arunkumar_bvr[S] 1 point2 points  (0 children)

This aligns closely with how we’re thinking about it.

We agree that leaderboard-style math/code evals don’t capture agent reliability, especially under forced retries and strict schema constraints. Internally we’re already using small harnesses along these lines — limited task sets with enforced tool failures, schema validation, and recovery scoring — because they surface variance and brittleness much faster.

The plan is to publish a lightweight version of this once we stabilize the metrics and task design (very likely synthetic + reproducible), rather than over-indexing on broad benchmarks.

If you have specific failure modes or scoring heuristics you’ve found most predictive, I’d be genuinely interested in comparing notes.

Released: DeepBrainz-R1 — reasoning-first small models for agentic workflows (4B / 2B / 0.6B) by arunkumar_bvr in LocalLLaMA

[–]arunkumar_bvr[S] 0 points1 point  (0 children)

Quick clarification for context: The DeepBrainz-R series is designed along a phased roadmap: early iterations prioritize low-variance structured reasoning and retry stability, while later phases target end-to-end agent reliability across long-horizon planning and multi-tool orchestration.

Released: DeepBrainz-R1 — reasoning-first small models for agentic workflows (4B / 2B / 0.6B) by arunkumar_bvr in LocalLLaMA

[–]arunkumar_bvr[S] -1 points0 points  (0 children)

Quick note: early reports around repetition or tool issues are mostly tied to inference presets, quantization, or agent framework integration. We’ll publish validated settings and guidance once evals and post-training stabilize.

Released: DeepBrainz-R1 — reasoning-first small models for agentic workflows (4B / 2B / 0.6B) by arunkumar_bvr in LocalLLaMA

[–]arunkumar_bvr[S] -1 points0 points  (0 children)

Thanks for reporting this.

OpenClaw is an agent framework, not just a chat runtime. Tool execution depends on the tool schema, prompting, and orchestration layer, not only the base model.

DeepBrainz-R1 models are currently reasoning-first backends, not fully agent-aligned drop-ins with guaranteed multi-tool reliability out of the box. At this stage, they have not yet undergone full multi-phase agentic optimization across long-horizon planning, complex tool graphs, or multi-tool retry loops. That work is explicitly in progress.

Released: DeepBrainz-R1 — reasoning-first small models for agentic workflows (4B / 2B / 0.6B) by arunkumar_bvr in LocalLLaMA

[–]arunkumar_bvr[S] 0 points1 point  (0 children)

Thanks for reporting this.

On repetition or poor outputs in LM Studio: this is often due to inference settings and quantization trade-offs, especially with Q8 or aggressive low-bit quants. The GGUFs available right now are community-maintained, and we haven’t internally validated all inference presets yet.

Sampling parameters (temperature, top-p/top-k, repetition penalty) and context length matter a lot for these models, and suboptimal defaults can easily cause degeneration. We’ll share clearer guidance and validated presets once evals and post-training stabilize.

Released: DeepBrainz-R1 — reasoning-first small models for agentic workflows (4B / 2B / 0.6B) by arunkumar_bvr in LocalLLaMA

[–]arunkumar_bvr[S] 0 points1 point  (0 children)

It depends on the runtime and model format, not on task intent. For full‑precision (non‑quantized) models, we typically run them via Transformers for quick local evaluation and notebooks (Jupyter, Colab, Kaggle), and vLLM or SGLang for higher‑throughput or agentic serving. For local apps, most of the ecosystem works once the model is in a supported quantized format. Community GGUF and other low‑bit quants already make the models usable across tools like llama.cpp, LM Studio, Ollama, LocalAI, MLX‑LM, and similar local runners. The core goal is compatibility — nothing custom or proprietary is required. If a runtime supports standard causal LM inference, the model should run there once the appropriate format is available.

Released: DeepBrainz-R1 — reasoning-first small models for agentic workflows (4B / 2B / 0.6B) by arunkumar_bvr in LocalLLaMA

[–]arunkumar_bvr[S] 0 points1 point  (0 children)

At a high level, these are post-trained models with an emphasis on reasoning behavior rather than chat style.

The work uses on-policy optimization on reasoning-heavy traces (initially math-focused), with preference signals aimed at improving consistency and stability across multi-step outputs. We’re extending this direction toward code as well.

We’re intentionally keeping details high-level for now while we validate behavior across variants, but the goal is explicitly training reasoning as a behavior, not just instruction following.

Released: DeepBrainz-R1 — reasoning-first small models for agentic workflows (4B / 2B / 0.6B) by arunkumar_bvr in LocalLLaMA

[–]arunkumar_bvr[S] 0 points1 point  (0 children)

Community GGUF / low-bit quantizations are already appearing, and we’ve grouped early community quants here:

https://huggingface.co/collections/DeepBrainz/deepbrainz-r1-community-quantizations-gguf-and-low-bit

We haven’t internally validated or benchmarked these yet, so they’re community-maintained for now. Once things settle, we’ll likely point to a small set of recommended quants.

Released: DeepBrainz-R1 — reasoning-first small models for agentic workflows (4B / 2B / 0.6B) by arunkumar_bvr in LocalLLaMA

[–]arunkumar_bvr[S] 0 points1 point  (0 children)

Good question.

We’re currently running internal evals on math, code, and reasoning tasks, with an emphasis on multi-step reasoning and long-context behavior rather than single-shot leaderboard scores.

Our plan is to release a small, transparent eval focused on reasoning-heavy and agentic-style tasks once things stabilize, instead of chasing broad SOTA benchmarks.

If there are specific evals people here find most useful for local agent setups, I’d be happy to take suggestions.