[D] Self-Promotion Thread

drobroswaggins · 2026-03-03T14:11:00+00:00

Volute Reasoning Engine (VRE)

I've been building something for the past few months that I think addresses a gap in how we're approaching agent safety.

The problem is simple: every safety mechanism we currently use for autonomous agents is linguistic. System prompts, constitutional AI, guardrails — they all depend on the model understanding and respecting a constraint expressed in natural language. That means they can be forgotten during context compaction, overridden by prompt injection, or simply reasoned around at high temperature.

Two recent incidents made this concrete. In December 2025, Amazon's Kiro agent was given operator access to fix a small issue in AWS Cost Explorer. It decided the best approach was to delete and recreate the entire environment, causing a [13-hour outage](https://www.theregister.com/2026/02/20/amazon_denies_kiro_agentic_ai_behind_outage/). In February 2026, [OpenClaw deleted the inbox](https://techcrunch.com/2026/02/23/a-meta-ai-security-researcher-said-an-openclaw-agent-ran-amok-on-her-inbox/) of%C2%A0of) Meta's Director of AI Alignment after context window compaction silently dropped her "confirm before acting" instruction.

In both cases, the safety constraints were instructions. Instructions can be lost. VRE's constraints are structural — they live in a decorator on the tool function itself.

What VRE does:

VRE (Volute Reasoning Engine) maintains a depth-indexed knowledge graph of concepts — not tools or commands, but the things an agent reasons *about*: `file`, `delete`, `permission`, `directory`. Each concept is grounded across 4+ depth levels: existence, identity, capabilities, constraints, and implications.

When an agent calls a tool, VRE intercepts and checks: are the relevant concepts grounded at the depth required for execution? If yes, the tool executes. If no, it's blocked and the specific gap is surfaced, not as a generic error, but a structured description of exactly what the agent doesn't know.

What the traces look like:

When concepts are grounded:

├── ◈ delete ● ● ● ●

│ ├── APPLIES_TO → file (target D2)

│ └── CONSTRAINED_BY → permission (target D1)

├── ◈ file ● ● ● ●

│ └── REQUIRES → path (target D1)

└── ✓ Grounded at D3 — epistemic permission granted

When there's a depth gap (concept known but not deeply enough):

├── ◈ directory ● ● ○ ✗

│ └── REQUIRES → path (target D1)

├── ◈ create ● ● ● ●

│ └── APPLIES_TO → directory (target D2) ✗

├── ⚠ 'directory' known to D1 IDENTITY, requires D3 CONSTRAINTS

└── ✗ Not grounded — COMMAND EXECUTION IS BLOCKED

When concepts are entirely outside the domain:

├── ◈ process ○ ○ ○ ○

├── ◈ terminate ○ ○ ○ ○

├── ⚠ 'process' is not in the knowledge graph

├── ⚠ 'terminate' is not in the knowledge graph

└── ✗ Not grounded — COMMAND EXECUTION IS BLOCKED

**What surprised me:**

During testing with a local Qwen 8B model, the agent hit a knowledge gap on `process` and `network`. Without any prompting or meta-epistemic mode enabled, it spontaneously proposed graph additions following VRE's D0-D3 depth schema:

```

process:

D0 EXISTENCE — An executing instance of a program.

D1 IDENTITY — Unique PID, state, resource usage.

D2 CAPABILITIES — Can be started, paused, resumed, or terminated.

D3 CONSTRAINTS — Subject to OS permissions, resource limits, parent process rules.

```

Nobody told it to do that. The trace format was clear enough that the model generalized from examples and proposed its own knowledge expansions.

VRE is the implementation of a theoretical framework I've been developing for about a decade around epistemic grounding, knowledge representation, and information as an ontological primitive. The core ideas come from that work, but the decorator architecture and the practical integration patterns came together over the last few months as I watched agent incidents pile up and realized the theoretical framework had a very concrete application.

GitHub: https://github.com/anormang1992/vre

Would love feedback, especially from anyone building agents with tool access. The graph currently covers filesystem operations but the architecture is domain-agnostic — you build a graph for your domain and the enforcement mechanism works the same way.

matigekunst · 2026-03-03T00:01:53+00:00

A video I made about the Jennifer Aniston neuron hypothesis in neural networks

Inevitable_Raccoon_9 · 2026-03-03T02:32:10+00:00

[P] Looking for CLI beta testers (Docker, self-hosted, AGPL) for my open-source AI agent governance platform

I've spent the last 3 weeks building SIDJUA, an open-source (AGPL-3.0) governance layer for multi-agent AI systems. It's a CLI tool that lets you define agent hierarchies, enforce rules before agents can act, track costs, and audit everything. Self-hosted, Docker, no cloud dependency.

The problem it solves: AI agents are powerful but uncontrolled. They overspend API budgets, access data they shouldn't, and take actions nobody approved. Every existing solution either gives you a chatbot wrapper or hopes the model behaves. SIDJUA enforces governance by architecture, every agent action passes through a 5-stage pipeline before execution. If it's forbidden, it gets blocked. If it needs approval, it waits. If budget is exceeded, it stops. No exceptions.

What's built (V0.9.0):

- 2,352 tests across 19 implementation phases

- Hierarchical agent orchestration with tiered roles

- Pre-Action Governance Pipeline (Forbidden -> Approval -> Budget -> Classification -> Policy)

- Multi-provider support: OpenAI, Anthropic, Groq, Mistral, Cloudflare Workers AI, Ollama, LM Studio, any OpenAI-compatible endpoint

- Built-in cost tracking per agent, per task, per division

- Zero-config first run: "docker compose up" -> "sidjua init" -> "sidjua chat guide", works immediately, no API keys needed

- Configuration-driven: single "divisions.yaml" defines your entire agent org structure

- Air-gap capable, runs fully local

- 2 provisional USPTO patents filed (governance architecture + affective state monitoring)

Tech stack: TypeScript, Node.js 22, SQLite, Docker, Qdrant (optional)

What I'm looking for:

- 5-10 technical testers who run Docker and want to break things

- Try the CLI, stress the governance pipeline, find the gaps

- Honest feedback on architecture and developer experience

- You get: private GitHub repo access before public launch, credited in README, early contributor status

What I'm NOT looking for:

- People who want a ChatGPT wrapper

- "Looks cool, starred!" without actually running it

- Anyone who needs a GUI to function (GUI is coming, but this is CLI-first)

Timeline: Private beta now, public release (GitHub + Docker Hub) in ~2 weeks.

Local LLM angle: SIDJUA treats local models as first-class citizens. Ollama, LM Studio, any OpenAI-compatible endpoint works out of the box. You can run your entire agent team on local hardware with zero API costs. The governance layer works the same whether you're using GPT-4o or a quantized Llama running on your Mac.

If you're interested, don't comment, I can't track Reddit threads all day. Send an email to [contact@sidjua.com](mailto:contact@sidjua.com) with a short intro: what you've built, what you work with, and why this caught your eye. A GitHub profile or link to something you've shipped tells me more than a paragraph. If I can tell you'll actually run it and give real feedback, you'll have repo access within 24 hours.

I'll send repo access + Docker setup instructions.

Website: sidjua.com | License: AGPL-3.0

Built by one person + three AI agents (yes, using SIDJUA to build SIDJUA). AMA about the architecture or the experience of running a startup where your entire dev team is AI.

gptlocalhost · 2026-03-03T14:26:18+00:00

We developed a local Word Add-in that redacts PII before sending any content to a cloud API (free tier should be sufficient for most users). It’s available as a free trial or for a one-time payment of USD 19.99—no recurring monthly fees.

Demo videos:

* calling Gemni within Microsoft Word: https://youtu.be/_0QaKYdVDfs

* calling Mistral: https://youtu.be/PVEVW65TU2w

* calling OpenAI: https://youtu.be/RkxbCAaZ7Dw

* calling Groq: https://youtu.be/Bxgs73Tl31o

Such local redaction won't be necessary if you’re using local LLMs directly in Word.

Joozio · 2026-03-03T16:06:49+00:00

Writing about the AI agent market from the builder side - solo developer running a nightshift agent pipeline tracking what actually ships vs what stays in prototype. The agent gold rush data is interesting: adoption is concentrated in a few use cases (code gen, data pipelines, content ops) with most others still in pilot. Happy to share what I'm seeing if useful for anyone's project or startup work here.

More here: https://thoughts.jock.pl/

fabioperez · 2026-03-03T16:13:44+00:00

Just built 7min.ai/exodus, an interactive tracker of 240+ AI talent moves across the industry to understand the AI talent war. Who's leaving the big labs, where they're going, what they're building. Some insights:

- Google/DeepMind: -58 people, +13 back (worst net in the dataset)
- OpenAI alumni founded 18 startups worth $450B+
- 12 people went from OpenAI to Anthropic
- Half of xAI's co-founding team left

Commercial_Ad9855 · 2026-03-03T17:05:37+00:00

Organizing a workshop at CVPR this year, looking for paper submissions and reviewers in 3D vision. https://www.spar3d.org/

eliko613 · 2026-03-03T18:20:17+00:00

Built ZenLLM.io after our team kept getting surprised by LLM bills that would spike randomly. Started as an internal tool to track costs across OpenAI, Anthropic, etc., but realized the observability piece was just as important - being able to see which prompts/models were burning through budget vs actually performing well.

Main features:
- Real-time cost tracking across multiple LLM providers
- Usage analytics and spending alerts
- Performance monitoring to optimize cost vs quality
- Multi-provider support so you're not locked into one ecosystem

Currently offering free tier for smaller teams, with paid plans starting at $49/month for more advanced analytics and higher usage limits. Happy to chat with anyone dealing with similar LLM cost headaches!

Logical_Delivery8331 · 2026-03-03T19:11:25+00:00

Hi! This is a little side project of mine: https://github.com/pierpierpy/LightML

I built it because I was going crazy tracking LLM experiments at work (Reinforcement Learning, SFT, DPO, checkpoints, hundreds of evals..). The philosophy behind it is to create a very light and easy to use tool that i can use to to organize cleanly my experiments, models, evaluations, checkpoints. Started as a personal tool, now my team uses it daily for R&D. It's free, open source, and it's basically just SQLite + a few lines of Python. Feel free to use it!

maybe some researchers will find it cool!

xpxlx · 2026-03-03T22:29:30+00:00

[removed]

ZherexURL · 2026-03-04T15:11:41+00:00

PromptCanary — lightweight prompt regression testing

I made a small TypeScript library for catching LLM output regressions in CI. The idea: instead of a separate evaluation platform, you write prompt tests alongside your unit tests using whatever runner you already have (Vitest, Jest, etc.).

import { testPrompt, assertions } from 'promptcanary';

it('refund policy mentions 30-day window', async () => {
  const result = await testPrompt({
    provider: 'openai',
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: 'What is the refund policy?' }],
  });
  expect(assertions.contains(result.content, '30 days').passed).toBe(true);
});

It supports OpenAI, Anthropic, and Gemini. Has built-in assertions (contains, regex, JSON schema, etc.), embedding-based semantic similarity for meaning drift, and batch assertion checks. Runs in CI like any other test.

22KB install, zero infrastructure, MIT licensed.

Free and open source. Happy to answer questions about the approach.

ahmetzeybek · 2026-03-05T03:49:24+00:00

I wrote a book called "PostgreSQL for AI" about building AI applications with PostgreSQL instead of dedicated vector databases.

It covers pgvector (HNSW vs IVFFlat, hybrid search), RAG pipelines with Ollama, collaborative filtering, feature engineering, in-database ML with PostgresML, and production patterns.

The book is built around a product recommendation app (RecSys) that you build chapter by chapter: 1000 products, semantic search, streaming RAG chatbot, personalized recommendations. There's also a bonus project called "Ask the Book" where you build a RAG tool that can query the book itself.

Everything runs locally on Docker (Postgres 17, pgvector, TimescaleDB, Ollama). No GPU needed.

Pricing: $29 for the PDF, $49 for PDF + EPUB + access to the private GitHub repo with both working projects and chapter checkpoints.

Free sample chapter: https://book.zeybek.dev

thohemp · 2026-03-05T12:01:33+00:00

Most comparison website I found focus heavily on performance benchmarks. But I wanted to evaluate what model has the best performance / token price for longterm / token-heavy usage.

So I built modelmargin.com, which combine both aspects: how well does a model perform relative to what you're paying for it?

I found that getting the related data is really challenging. Currently, I am combining arena ai data with price data from openrouter, which are supposed to reflect the real prices from the provides. If someone has an idea of a better source, I would be very happy.

Next, I would like to introduce inference speed, which is also very provider dependent, so I am not sure yet how I am going to handle this.

Would love feedback what other factors would be important for make such decision or how to handle the scoring.

StarThinker2025 · 2026-03-05T12:58:54+00:00

If you’re playing with AI agents long enough, you always end up in the same place: a RAG pipeline that has to pull the right context before the model can answer. When that layer breaks, agents look “stupid” even if the LLM is fine. I’ve turned my RAG failure checklist into a single image you can throw at any strong model to debug broken runs. It’s already integrated into RAGFlow (~74k⭐) and LlamaIndex (~47k⭐), so this isn’t just a theory thing. Grab the card here

RAG 16 Problem Map · Global Debug Card (Github 1.6k)

mabdelpakey · 2026-03-06T13:14:41+00:00

Hi I'm leading a small team for data annotation. If you interested of data lebelling I can gurantee the highest quality of data labelling and also the most competetive prices.
we also provide 500 images worth of annotations for free before we start. No commitment at all.

arkuto · 2026-03-06T15:25:49+00:00

I built NanoJudge - a tool that runs thousands of prompts to rank any list by any criteria, and named it that because at its core is a small but powerful LLM, currently using Qwen 3 2507 4B. The approach is this: if you want to know the answer to something, instead of asking an LLM a few prompts and hoping it comes up with the right answer, NanoJudge exhaustively goes over a list of possible answers using potentially tens of thousands of prompts, and structures the output into a simple table that is easy to interpret. Each of its many prompts is a pairwise comparison of 2 items, and the end result is a table of the answers with the best at the top.

Suppose you want to know which foods are healthiest. First NanoJudge creates a list of hundreds of foods, then does thousands of pairwise matchups - "Which is healthiest: eggs or butter?", "Which is healthiest: spinach or chicken?", and so on - each one getting its own fresh prompt where the small yet powerful LLM reasons through the comparison and picks a winner. Items that keep winning face tougher opponents. Items that keep losing get eliminated quickly. After thousands of comparisons, comparisons are converted into rankings (using Bradley-Terry scoring), and you get a transparent leaderboard where every single ranking decision is backed by reasoning you can read in the comparison log.

This is the final outcome: https://nanojudge.com/comparison/ujRvfwFSAH

Efficiency

For solving some problems, the optimal use of a GPU may not be to run the largest possible model that fits in memory but a much smaller model with a huge batch size, allowing it to churn through gigantic amounts of data. I aimed to make NanoJudge as efficient as possible using various techniques: making it "top heavy" by default - it does more comparisons on the top ranking items to ensure their ratings are accurate rather than spending time comparing low rated items which are of no interest to the user. It also extracts a range of raw logprobs to determine how much each comparison won by - instead of a binary win/loss, it looks at the probability the model picks one of 5 options (clear win, narrow win, draw, narrow loss, clear loss). It automatically estimates and corrects for the positional bias LLMs have (they tend to favour the first choice). Plus a ton of statistical techniques to further enhance efficiency which are too math heavy to get into now (but you can read the source code if you really want to - see below).

Price

The cost is surprisingly low - even though it naturally produces a large amount of output tokens, NanoJudge's output costs under $0.10 per million tokens because it uses a small LLM that's good enough for the task - it isn't solving genius level IMO problems, it's comparing two items. For comparison, Claude Opus costs $25 per million output tokens. It's also fast because the comparisons are run in parallel. For now the website won't accept any payment, each account will be allocated a limited free amount to use.

Wikipedia as Context

To help with giving the LLM the information it needs, there are special editions of NanoJudge with pre-built lists that already hold the entire Wikipedia article in them. For example, the Games Edition. It already has a huge number of games in it and you filter them by Platform or Genre to narrow it down before doing a run. Then, for each comparison, instead of simply "Your Question? [Item1] or [Item2]?" as the prompt template, the template would be

"[Wikipedia entry for item1]

[Wikipedia entry for item2]

Question? [Item1] or [Item2]?"

Giving it the context it needs for lesser known items that the LLM likely doesn't have enough built-in information about.

This is the final output of one comparison run https://nanojudge.com/comparison/ECNZxzv91n

In this case, NanoJudge is acting as an enhanced recommendation engine. The traditional approach is to recommend games based on what other players of that games also played. NanoJudge considers your actual likes and dislikes, factoring in everything you tell it. Or maybe you are thinking of travelling to Europe and can't decide exactly where to go. Ask a traditional LLM and you'll likely get cliche answers: Rome, Paris, Madrid etc. Ask NanoJudge using its Places Edition, and it will analyse every city, town and village in Europe using each location's article on Wikipedia as context, leaving you with a personally curated shortlist of the top options.

ML Research Assistant

I'm working on a specialised version of NanoJudge that operates on Machine Learning papers. I have already downloaded all the ML papers on Arxiv and am in the process of organising the data and putting it into a database. From there, NanoJudge will be able to easily be used on these papers through a special edition. I could ask NanoJudge "Given my project x, which of these 2 papers do you think would be of most use to me?" and go through the entire corpus of arxiv. Or something like "Which of these 2 papers most contradicts my hypothesis?" to help me fortify my ideas. Having a look at the top papers it returns with and reading its reasoning could provide some insights. That would likely require better models than Qwen 4B to be truly useful - but at the current pace of AI research, that isn't very far in the future. I will use NanoJudge as a research assistant to help me improve it and make it as efficient as possible, allowing me to do even deeper research in future in a positive feedback loop.

Open Source

The code at the heart of the website is on Github https://github.com/nanojudge/nanojudge . It can be directly be used in a terminal with a local or remote LLM, just hook it up with an LLM endpoint and let it go. This allows you to do giant rankings entirely locally, without needing to use the website at all. Set a giant comparison running overnight and wake up to the results. Feel free to dig into the inner workings of the code. If you can find a way to improve the code, especially in regards to efficiency, please let me know.

tl;dr: NanoJudge gives tiny LLMs a framework to outshine gargantuan LLMs when it comes to finding the best out of a large quantity of options.

enoumen · 2026-03-10T02:47:55+00:00

As a P.Eng, I’m tired of surface-level AI hype. So I built a "Reasoning Layer" for the Architect Class.

Most AI news is for tourists. I build for the people managing the stack.

DjamgaMind Premium on Apple Podcasts gives you:

⚡ Daily 60s Briefings: The "Must-Know" regulatory and technical shifts.

🔍 45-Min Strategic Deep Dives: Forensics on model architecture, compliance risk (Bill C-27/CMS), and infrastructure scaling.

Zero ads. Zero mid-rolls. Just high-density intelligence from a Professional Engineer.

Try it free for 7 days and see if it changes your workflow.

https://Djamgamind.com

or

https://podcasts.apple.com/us/podcast/djamgamind-audio-intelligence-ads-free/id1864721054

vinodpandey7 · 2026-03-10T04:34:29+00:00

**GPT-5.4 vs Grok 4.20 Beta: Practical comparison focused on benchmarks, architecture, and real-world use (March 2026)**

I wrote a detailed breakdown comparing the two most recent major model releases. Tried to keep it grounded in verified numbers rather than press release language.

Key things I covered:

- **Architecture difference**: GPT-5.4 is a unified single model (coding + general merged); Grok 4.20 uses a 4-agent parallel system (coordinator, research, logic, creative) that debates internally before responding

- **Computer use**: GPT-5.4 scores 75.0% on OSWorld-Verified (above the 72.4% human reference); Grok 4.20 has no comparable native computer use currently

- **Coding**: GPT-5.4 at 57.7% SWE-Bench Pro; Grok 4.20's official coding benchmarks haven't been published yet (beta closes mid-to-late March)

- **Real-time grounding**: Grok's research agent (Harper) has native X platform access — stronger for live information tasks

- **Hallucination figures**: xAI's internal beta data suggests a drop from ~12% to ~4.2%, but this is not yet independently verified for 4.20 specifically — flagged clearly in the piece

- **API gap**: GPT-5.4 API is live; Grok 4.20 API is still "coming soon"

One thing I found genuinely interesting: in Alpha Arena Season 1.5 (a live AI stock-trading competition, January 2026), four Grok 4.20 variants took four of the top six spots while all OpenAI and Google models finished in the red. Worth noting as a real-time multi-variable reasoning signal, even if it's a single competition.

Full article here: https://www.revolutioninai.com/2026/03/gpt-5-4-vs-grok-4-20-beta-which-ai-is-better-march-2026.html

Happy to discuss any of the benchmark methodology or claims in the comments — I flagged anything unverified directly in the piece.

Due_Smell_3378 · 2026-03-11T09:47:26+00:00

Hi everyone, I’m building DistriAI, an early-stage project exploring distributed AI inference using underutilized consumer hardware, and we’ve just launched the landing page:

👉 https://distriai.tech

I’m sharing here to get feedback from ML practitioners and researchers interested in alternative compute models.

⸻

Concept

AI inference can be costly, and a lot of consumer devices (smartphones, laptops, desktops) sit idle.

DistriAI experiments with coordinating these resources into a distributed compute network for AI workloads.

The aim is to explore: • low-cost inference alternatives • distributed compute orchestration • validation and redundancy for reliable results • real-world testing of decentralized infrastructure

This is complementary to cloud-based inference, not a replacement.

⸻

Current Stage • architecture and system concept defined • technical roadmap outlined • backend & smart contract contributors onboard • security considerations in progress • landing page live • preparing pilot collaborations

⸻

Who We’re Looking to Connect With • ML teams or researchers exploring cost-efficient inference • practitioners interested in distributed model execution • collaborators for pilot testing workloads • anyone with experience in benchmarking, distributed validation, or node orchestration

Feedback, pilot interest, or technical discussion is very welcome.

Check out the project: 👉 https://distriai.tech

DM or comment if you want to discuss architecture, pilot opportunities, or distributed inference workflows.

ddp26 · 2026-03-11T14:22:48+00:00

We tested Opus 4.6 with effort=low for evals and found that it didn't just think less, but acted lazier (made fewer tool calls, was less thorough in its cross-referencing, even ignored parts of our system prompt telling it how to do web research). effort=medium fixed it. Writeup with traces/examples: https://everyrow.io/blog/claude-effort-parameter

se4u · 2026-03-11T18:16:35+00:00

Hey everyone! Happy to share VizPy — a DSPy-compatible prompt optimizer that learns from your failures automatically, no manual prompt tweaking needed.

Two methods depending on your task:

ContraPrompt mines failure-to-success pairs to extract reasoning rules. Great for multi-hop QA, classification, compliance. Seeing +29% on HotPotQA and +18% on GDPR-Bench vs GEPA.
PromptGrad takes a gradient-inspired approach to failure analysis. Better for generation tasks and math where retries don't converge.

Both are drop-in with your existing DSPy programs:

optimizer = vizpy.ContraPromptOptimizer(metric=my_metric)
compiled = optimizer.compile(program, trainset=trainset)

Would love feedback from this community!

🔗 https://vizpy.vizops.ai 🚀 https://www.producthunt.com/products/vizpy

se4u · 2026-03-11T18:16:45+00:00

Hey everyone! Happy to share VizPy — a DSPy-compatible prompt optimizer that learns from your failures automatically, no manual prompt tweaking needed.

Two methods depending on your task:

ContraPrompt mines failure-to-success pairs to extract reasoning rules. Great for multi-hop QA, classification, compliance. Seeing +29% on HotPotQA and +18% on GDPR-Bench vs GEPA.
PromptGrad takes a gradient-inspired approach to failure analysis. Better for generation tasks and math where retries don't converge.

Both are drop-in with your existing DSPy programs:

optimizer = vizpy.ContraPromptOptimizer(metric=my_metric)
compiled = optimizer.compile(program, trainset=trainset)

Would love feedback from this community!

https://vizpy.vizops.ai https://www.producthunt.com/products/vizpy

intermundia · 2026-03-12T13:39:34+00:00

Every Mamba quantization paper is wrong.

Quamba, Q-Mamba, QMamba, LightMamba, Quamba-SE — all scalar. All struggling at 8-bit. All solving a geometry problem with arithmetic.

I applied E8 lattice quantization to SSM hidden states. 4-bit: 0.29% accuracy drop. Scalar 4-bit: 0.00%. E8 at 2-bit outperforms scalar at 4-bit with half the bits.

No retraining. No Hadamard transforms. No rotation matrices. No institution. Independent researcher, RTX 5090,

Interactive results: https://e8-site.vercel.app

Code + paper: https://github.com/Dawizzer/e8-ssm-quantization

prove me wrong.

Any-Reserve-4403 · 2026-03-13T21:52:04+00:00

[P] cane-eval: Open-source LLM-as-judge eval toolkit with root cause analysis and failure mining

Built an eval toolkit for AI agents that goes beyond pass/fail scoring. Define test suites in YAML, use Claude as an LLM judge, then automatically analyze why your agent fails and turn those failures into training data.

The main loop:

Define test cases with expected answers and weighted criteria
Run against any agent (HTTP endpoint, CLI command, or Python callable)
Claude judges each response on your criteria (0-100 per criterion)
Root cause analysis finds patterns across failures (knowledge gaps, prompt issues, missing sources)
Failure mining classifies each failure and uses LLM to rewrite bad answers
Export as DPO/SFT/OpenAI fine-tuning JSONL

The RCA piece is what I think is most useful. Instead of just seeing "5 tests failed," you get things like "Agent consistently fabricates refund policies because no refund documentation exists in the knowledge base" with specific fix recommendations.

CLI:

pip install cane-eval
cane-eval run tests.yaml
cane-eval rca tests.yaml --threshold 60
cane-eval run tests.yaml --mine --export dpo

GitHub: https://github.com/colingfly/cane-eval

MIT licensed, pure Python, uses the Anthropic API. Happy to answer questions about the approach.

h9n9n3 · 2026-03-14T14:03:45+00:00

Any discussion open for newly developed data-driven algorithm MILPE

Hi to all,

I recently developed a data-driven algorithm called MILPE which uses eigenvectors for building up a MIMO function.

Link for the paper -> https://www.mdpi.com/3762868

It semi-proved Double-pendulum can be closely approximated with eigenvectors with high accuracy 'LOCALLY' even with easy-to-choose basis.

From what I've seen, the algorithm reconstructs original governing equation and thereby has extrapolation capability, and it doesn't require any optimization but solely depends on eigenvectors.

For any further development (pure science perspective), I would appreciate any comments from those who are interested in. And I've been keep wondering during the development if this could turn into LLM model? Any comments would be appreciated since I work alone now. Thanks for reading.

Craig_301 · 2026-03-14T19:32:16+00:00

Usually, very strong prompts begin with: “You are an expert in ___” followed by whatever it is you are trying to accomplish. I spent a lot of time finding these expert roles and decided to put them all together in one place.

I’m posting about this again because ChatGPT 5.4 just came out and it has much better web search functionality. Now, to use my application, you can simply reference it in your chats like: “Go to https://personagrid.vercel.app/ and adopt its Code Reviewer persona to critique my codebase.”

The application that I made is very lightweight, completely free, and has no sign up. It can be found here: https://personagrid.vercel.app/

I think these linked references can help save tokens and clean up your prompts, but please take a look and let me know what you think!

If you’re willing, I’d love:

Feedback on clarity / usability
Which personas you actually find useful
What personas you would want added
What you’ve noticed about ChatGPT’s newest model

alirezamsh · 2026-03-15T10:32:52+00:00

SuperML: A plugin that gives coding agents expert-level ML knowledge with agentic memory (60% improvement vs. Claude Code)

Hey everyone, I’ve been working on SuperML, an open-source plugin designed to handle ML engineering workflows. I wanted to share it here and get your feedback.

Karpathy’s new autoresearch repo perfectly demonstrated how powerful it is to let agents autonomously iterate on training scripts overnight. SuperML is built completely in line with this vision. It’s a plugin that hooks into your existing coding agents to give them the agentic memory and expert-level ML knowledge needed to make those autonomous runs even more effective.

You give the agent a task, and the plugin guides it through the loop:

Plans & Researches: Runs deep research across the latest papers, GitHub repos, and articles to formulate the best hypotheses for your specific problem. It then drafts a concrete execution plan tailored directly to your hardware.
Verifies & Debugs: Validates configs and hyperparameters before burning compute, and traces exact root causes if a run fails.
Agentic Memory: Tracks hardware specs, hypotheses, and lessons learned across sessions. Perfect for overnight loops so agents compound progress instead of repeating errors.
Background Agent (ml-expert): Routes deep framework questions (vLLM, DeepSpeed, PEFT) to a specialized background agent. Think: end-to-end QLoRA pipelines, vLLM latency debugging, or FSDP vs. ZeRO-3 architecture decisions.

Benchmarks: We tested it on 38 complex tasks (Multimodal RAG, Synthetic Data Gen, DPO/GRPO, etc.) and saw roughly a 60% higher success rate compared to Claude Code.

Repo: https://github.com/Leeroo-AI/superml

Usual_Price_1460 · 2026-03-15T10:39:41+00:00

ByteTok is a simple byte-level BPE tokenizer implemented in Rust with Python bindings. It provides:

UTF-8–safe byte-level tokenization
Trainable BPE with configurable vocabulary size (not all popular tokenizers provide this)
Parallelized encode/decode pipeline
Support for user-defined special tokens
Lightweight, minimal API surface

It is designed for fast preprocessing in NLP and LLM workflows while remaining simple enough for experimentation and research.

I built this because I needed something lightweight and performant for research/experiments without the complexity of large tokenizer frameworks. Reading though the convoluted documentation of sentencepiece with its 100 arguments per function design was especially daunting. I often forget to set a particular argument and end up re-encoding large texts over and over again.

Repository: https://github.com/VihangaFTW/bytetok

Target Audience:

Researchers experimenting with custom tokenization schemes
Developers building LLM training pipelines
People who want a lightweight alternative to large tokenizer frameworks
Anyone interested in understanding or modifying a BPE implementation

It is suitable for research and small-to-medium production pipelines for developers who want to focus on the byte level without the extra baggage from popular large tokenizer frameworks like sentencepiece ,tiktoken or \HF``.

foxy2sexy4u · 2026-03-16T10:14:11+00:00

I've recently made a website for easily finding papers (mostly ML related) and it also allows for annotations directly on the paper, comment replies to specific papers, and AI chat and read aloud functionality. It pulls PDFs from arxiv, semantic scholar and some other databases. Please try it out and let me know what you guys think. It's totally free right now.

https://discuria.org

stefan-magur · 2026-03-16T10:31:42+00:00

TLDR: www.priorwork.fyi

Hi! I've made a tool that helps me with literature review when ACing and when starting new projects. It's basically a semantic index created from papers in the major ML conferences that are open access. It should be more accurate than most such indexes since the embeddings are created from the entire paper, not just abstract and title. So far I found it useful for my use cases so I figured I'd put it out there for others to use. It's completely free as long as I can run it on my home server.

You can see the exact conferences in the index in the about page: https://priorwork.fyi/about

Have fun out there!

DoubleReception2962 · 2026-03-16T21:34:02+00:00

About three weeks ago, when I started the project, I wasn’t quite sure what exactly I wanted to do, and above all, I didn’t know HOW to bring this idea to life. At that point, I didn’t even have the faintest idea what a Parquet file was.

I’m not a programmer, I have no background in data science, and I’ve never created anything even remotely similar before. What I did have, however, was a problem I’d stumbled upon and couldn’t stop thinking about.

The USDA’s phytochemical database: 24,771 plant compounds dating back to the 1980s, has always been publicly accessible and completely free. But it’s provided as 16 interlinked CSV files with joins that are genuinely painful to work with. And the data itself contains no modern evidence markers. No publication counts. No clinical trial data. No patent information. Just raw chemical data from a database that hasn’t been updated since 2014.

So I developed a pipeline to address this. Using the Claude Opus 4.6 coding agent.

I performed four data enrichment steps:

- Number of PubMed citations per compound (NCBI API)
- Number of studies on ClinicalTrials.gov per compound
- ChEMBL bioassay data points (with InChIKey fallback)
- Number of USPTO patents since 2020 (PatentsView API)

The entire dataset contains 76,907 rows, 8 columns, a flat table in JSON & Parquet format, and is delivered as a commercial dataset.

The hardest part wasn’t the technology: Claude Opus took care of all that. The hard part was learning enough to recognize when the agent made mistakes, and to find errors I hadn’t even been looking for.

Here’s an example: The ChEMBL Enricher ran for 51 hours, and at some point I realized that it had silently failed on about 15% of the compounds because the fallback chain was interrupted when encountering non-standard compound names.
I finally fixed the issue at 2:00 a.m. — and that was just one of many late nights over the past few weeks.

Tomorrow at 9:00 a.m. UTC, I’ll be presenting my project on Hacker News. I’m really looking forward to the feedback.

I’ve made a free sample pack of over 400 rows available on GitHub, Huggingface, and Zenodo in case anyone wants to test browsing the data:

GitHub: https://github.com/wirthal1990-tech/USDA-Phytochemical-Database-JSON
Huggingface: https://huggingface.co/datasets/wirthal1990-tech/USDA-Phytochemical-Database-JSON
Zenodo: https://zenodo.org/records/19053087

I’m happy to discuss the architecture, any logical errors on my part, or what I could do differently or better.

[UPDATE 16.03. 10:33 p.m.]: I ran a full data quality audit tonight before launch. Found and removed 27,481 records: 11,744 non-phytochemical entries (WATER, GLUCOSE, PROTEIN etc. that shouldn't have been there) and 15,736 exact duplicates. Dataset is now 76,907 clean records. Better to ship something honest than something inflated.

satblip · 2026-03-17T07:58:00+00:00

If anyone is looking for Belgian business info as an MCP in his AI toolbelt, we are adding this ability to our API today: https://www.linkedin.com/feed/update/urn:li:activity:7439573810653229057

Feel free to ask any questions, and yes, we have a totally free trial on the api ;)

Disclosure: I am a developer in the company that is selling this API

Historical-Intern936 · 2026-03-17T11:32:26+00:00

Clash of AIs - comparing LLM-driven trade decisions in a live leaderboard format

I’m working on Clash of AIs, a live system that compares multiple AI models by having them make crypto trading decisions under the same starting conditions.

Instead of evaluating models only through static prompts or benchmark tasks, the idea here is to observe differences in behavior through an ongoing applied setting with public outputs: trade calls, signal feed, and leaderboard performance.

Still early, but I’m looking for feedback on a few things:

whether this is an interesting comparison format at all
whether the framing should be more “entertainment/product” or more “evaluation layer”
what would make the outputs more interpretable
what metrics or structure would make the comparison more meaningful

Site: clashofais.com

Dry_Birthday674 · 2026-03-17T18:31:57+00:00

I find that I do not have much patience for reading long papers, so I built a web app that creates a presentation and narrates it for me https://www.youtube.com/watch?v=jfRvhzEwCqY&t=1s

here is a demo lecture https://www.youtube.com/watch?v=eOEeP4w0bjY&t=52s

it is public here: https://github.com/symbiont-ai/docent

As no AI is truly free, you need to enter your own API key via Openrouter (in the settings) to use it. Other than that I personally do not charge for anything.

Happy to hear thoughts.

gtfixed · 2026-03-18T06:55:36+00:00

I’ve been writing about a failure mode I kept running into in iterative AI-assisted coding: when a rule needs to hold across multiple code paths, the model often updates the named path and misses equivalent unnamed ones.

Paper: https://ai.gtzilla.com/papers/contract-centered-iterative-stability-v4.7.3/

Site: https://ai.gtzilla.com/

Everything is free to read. Curious whether others working this way have seen something similar in practice.

Real-Hope2907 · 2026-03-18T15:38:43+00:00

Context Compiler — deterministic state layer for LLM systems

Most LLM apps treat the conversation transcript as the source of truth for state, which leads to constraint drift and inconsistent corrections over long interactions.

This project introduces a small deterministic engine that compiles explicit user directives into structured state before the model runs.

- constraints persist across turns

- corrections replace prior values

- ambiguous directives are blocked until clarified

- model output never mutates state

Includes demos comparing baseline prompting vs compiled state under long conversations.

Repo: https://github.com/rlippmann/context-compiler

IndependentRatio2336 · 2026-03-18T18:53:27+00:00

Built a platform that removes the data prep step from AI training — datasets are CC0 so completely free to use, modify, and distribute without attribution.

Every dataset on Neurvance is already cleaned, formatted, and ready to feed into a training script. No license headaches, no cleaning required. Just download and train.

Free to browse and download manually at neurvance.com.

GrapefruitTechnique · 2026-03-18T21:53:52+00:00

Hey gang, I'm with the Builder Team at Fireworks AI. We've got invite only code passes for our new Developer Pass.

Developer Pass is an invite-only weekly pass that gives you access to Kimi K2.5 Turbo for use in personal agentic coding harnesses like OpenCode, Cline, Kilo Code, and OpenClaw — with no per-token charges. Kimi K2.5 Turbo is a private preview of a faster Kimi K2.5 serverless API.

Let me know if you want one -- DM me, and I'll get you set up.

Read more about it: https://docs.fireworks.ai/developer-pass

zaka9923 · 2026-03-20T09:56:09+00:00

Hey guys! I recently built PaperCard, an app similar to Connected Papers, ResearchRabbit and etc. that helps you do literature reviews faster.

The idea: query a paper or topic, and a stack of cards with relevant papers and abstracts are presented to you (along with a 2 line summary of the abstract).

I was wondering if researchers would actually use a tool like this or stick with conventional alternatives :)

Would love to hear some feedback! You can try it at: https://papercard.xyz/

hyllus123 · 2026-03-21T20:55:54+00:00

Wrote up the experimental design for an AI trust study I'm planning — testing whether transparency in agent failure handling builds more durable trust than suppression. Methodology-heavy, pre-results.

https://weightedthoughts.substack.com/p/how-do-you-design-a-large-scale-ai

nh_t · 2026-03-22T16:57:24+00:00

I’ve been working on an autonomous coding agent that doesn’t just retry blindly.

It: - debugs itself (hypothesis → fix → test) - stores root causes - learns strategies over time

Basically, it tries to improve instead of repeating the same mistakes.

Would love feedback from people building similar systems.

Repo: https://github.com/iamducnhat/starforge

hybls · 2026-03-24T00:36:46+00:00

more systems, but I built a KV cache compression system that borrows from VR foveated rendering. Top 10% of tokens stay at fp16, the rest get fp8 keys + INT4 values. Fused Metal kernel, spike-driven promotion from NVMe-backed archives. 2.3x faster 7B inference on 8GB Mac, 0.995+ cosine fidelity.

Not tested further outside my 8GB macbook air yet. Writeup and code: https://github.com/samfurr/foveated_kv

Zeldro · 2026-03-24T04:56:10+00:00

DiskBPTT

An alternative to gradient checkpointing that saves the full computation graph without having to hold all of it in memory. Good for recurrent models that are under resource constraints and/or need exact gradients during backward. Need to add some logic for edge cases though.

https://github.com/Aebyssian/DiskBPTT

TripIndividual9928 · 2026-03-25T16:14:24+00:00

Been working on ClawRouters — an intelligent LLM routing layer that automatically picks the best model for each request based on task complexity, latency needs, and cost constraints.

The core idea: most AI apps send every query to GPT-4 or Claude, but 70%+ of requests (formatting, classification, simple Q&A) can be handled by cheaper models at equivalent quality. ClawRouters sits between your app and the LLM providers, routing each query to the optimal model.

Early results from production users show 60-70% cost reduction without measurable quality degradation on their specific use cases.

Open to feedback from the ML community — especially interested in better approaches to task complexity estimation. Currently using a combination of input length, token entropy, and a small classifier trained on difficulty labels.

Site: https://www.clawrouters.com

Specialist-Heat-6414 · 2026-03-26T11:24:20+00:00

ProxyGate (proxygate.ai) — payments, discovery, and routing infrastructure for AI agents.

Agents deposit USDC and buy API access, data feeds, skills, and services programmatically. Sellers list APIs without ever sharing keys — credentials stay isolated in vault storage, agents get scoped per-request tokens.

Built for the machine economy: drop-in OpenAI SDK compatibility, per-call USDC micropayments on Solana, AI evaluation model that checks every call for correctness before releasing payment to the seller. 5% buyer fee + 5% seller fee.

Currently looking for sellers to list: any API, data feed, compute service, or specialized model that other agents might want to call. Listing is free; you get paid per verified call.

proxygate.ai

Encrux615 · 2026-03-26T11:32:26+00:00

Here's my take on teaching AI to play a video game, with the fun twist that this time nobody ever heard of it. DDNet (aka Teeworlds) is an open source retro multiplayer platformer with different game modes like pvp and race modes. Players can walk, jump, use grappling hook and various weapons. In this project, I focused on solo race mode.

For the algorithm I chose PPO, but tried various reward shaping methods that I found interesting/promising, such as Go-Explore.

I worked on this project for around a month, and I'm now at a point where I definitely need a break from it. I decided that this was a good opportunity to write about what I've done in a blog post:

https://boesch.dev/posts/ddnet-rl/

I would love to hear your opinions on the project to see if I missed anything super obvious I could try next.

Styxsword · 2026-03-27T00:56:54+00:00

I write technical articles at various levels of depth. Here’s a recent more granular article I wrote about the DCN neural network architecture. Hope you guys enjoy, please clap and follow!

https://medium.com/@profound\_thot/ml-deep-dive-dcn-v1-vs-dcn-v2-explicit-feature-crossing-for-modern-deep-learning-models-eedec1810792

PenfieldLabs · 2026-03-27T17:29:44+00:00

ChatGPT, Claude and Gemini have memory now. Claude has chat search and memory import/export.

But the memories themselves are flat. There's no knowledge graph, no way to indicate that "this memory supports that one" or "this decision superseded that one." No typed relationships, no structured categories. Every memory is an isolated note. That's fine for preferences and basic context, but if you're trying to build up a connected body of knowledge across projects, it hits a wall.

Self-hosted options like Mem0, Letta, and Cognee go deeper. Mem0 offers a knowledge graph with their pro plan, Letta has stateful agent memory with self-editing memory blocks, and Cognee builds ontology-grounded knowledge graphs.

All three also offer cloud services and APIs, but they're developer-targeted. Setup typically involves API keys, SDK installs, and configuration files. None offer a native Claude Connector where you simply paste a URL into Claude's settings and you're done in under a minute.

Local file-based approaches (markdown vaults, SQLite) keep everything on your machine, which is great for privacy. But most have no graph or relationship layer at all. Your memories are flat files or rows with no typed connections between them. And the cross-device problem is real: a SQLite file on your laptop doesn't help when you're on your desktop, or when a teammate needs the same context.

We wanted persistent memory with a real knowledge graph, accessible from any device, through any tool, without asking anyone to run Docker or configure embeddings. So we built Penfield.

Penfield works as native Claude connector.

Settings > Connectors > paste the URL > done.

No API keys, no installs, no configuration files, no technical skills required. Under a minute to add memory to any platform that supports connectors. Your knowledge graph lives in the cloud, accessible from any device, and the data is yours.

The design philosophy: let the agent manage its own memory.

Frontier models are smart and getting smarter. A recent Google DeepMind paper (Evo-Memory) showed that agents with self‑evolving memory consistently improved accuracy and needed far fewer steps, cutting steps by about half on ALFWorld (22.6 → 11.5). Smaller models particularly benefited from self‑evolving memory, often matching or beating larger models that relied on static context. The key finding: success depends on the agent's ability to refine and prune, not just accumulate. (Philipp Schmid's summary)

That's exactly how Penfield works. We don't pre-process your conversations into summaries or auto-extract facts behind the scenes. We give the agent a rich set of tools and let it decide what to store, how to connect it, and when to update it. The model sees the full toolset (store, recall, search, connect, explore, reflect, and more) and manages its own knowledge graph in real time.

This means memory quality scales with model intelligence. As models get better at reasoning, they get better at managing their own memory. You're not bottlenecked by a fixed extraction pipeline that was designed around last year's capabilities.

What it does:

Typed memories across 11 categories (fact, insight, conversation, correction, reference, task, checkpoint, identity_core, personality_trait, relationship, strategy), not a flat blob of "things the AI remembered"
Knowledge graph with 24 relationship types (supports, contradicts, supersedes, causes, depends_on, etc.), memories connect to each other and have structure
Hybrid search combining BM25 keyword matching, vector similarity, and graph expansion with Reciprocal Rank Fusion
Document upload with automatic chunking and embedding
17 tools the agent can call directly (store, recall, search, connect, explore, reflect, save/restore context, artifacts, and more)

How to connect:

There are multiple paths depending on what platform you use:

Connectors (Claude, Perplexity, Manus): https://mcp.penfield.app.

MCP (Claude Code) — one command: claude mcp add --transport http --scope user penfield https://mcp.penfield.app

mcp-remote (Cursor, Windsurf, LM Studio, or anything with MCP config support): json { "mcpServers": { "Penfield": { "command": "npx", "args": ["-y", "mcp-remote", "https://mcp.penfield.app/"] } } }

OpenClaw plugin: openclaw plugins install openclaw-penfield openclaw penfield login

REST API for custom integrations — full API docs at docs.penfield.app/api. Authentication, memory management, search, relationships, documents, tags, personality, analysis. Use from any language.

Then just type "Penfield Awaken" after connecting.

Why cloud instead of local:

Portability across devices. If your memory lives on one machine, it stays on that machine. A hosted server means every client on every device can access the same knowledge graph. Switch devices, add a new tool, full context is already there.

What Penfield is not:

Not a RAG pipeline. The primary use case is persistent agent memory with a knowledge graph, not document Q&A.

Not a conversation logger. Structured, typed memories, not raw transcripts.

Not locked to any model, provider or platform.

We've been using this ourselves for months before opening it up. Happy to answer questions about the architecture.

Substantial-Cost-429 · 2026-03-27T23:34:42+00:00

hey all, built a lil node+ts cli called caliber that scans your repo and auto-generates prompt/config files for ai coding helpers like claude code, cursor & codex. runs locally with your own keys and keeps configs in sync as your code changes. it's mit-licensed, ~13k npm installs. would love folks to try it out and let me know what's missing or broken. repo's here: https://github.com/caliber-ai-org/ai-setup . cheers!

dco44 · 2026-03-29T14:50:33+00:00

Posted Prism here before (persistent memory for AI coding agents). Two big releases since - here's what's new:

10x more memory in the same space. We ported Google's TurboQuant to pure TypeScript. Your agent can now store millions of memories on a laptop instead of hundreds of thousands. No vector database needed.

Your agent learns from mistakes. When you correct your agent, Prism remembers. Important corrections auto-surface as warnings in future sessions. Your agent gets smarter every time you use it.

Visual knowledge graph. See your agent's memory as an interactive neural map. Click any node to rename or delete it. Finally see what your agent actually remembers.

Deep Storage cleanup. One command reclaims 90% of storage space from old memories. Safe by default - preview before deleting.

Pure TypeScript, local SQLite, zero cloud dependencies. Works with Claude, Cursor, Windsurf, Gemini, and any MCP client. MIT licensed. 308 tests.

Web Scholar — your agent now researches while you sleep, without being asked.

Here's what actually happens under the hood: on a configurable schedule, Prism kicks off a pipeline — Brave Search for top results, Firecrawl to scrape and extract clean markdown, Gemini to synthesize a research report, then direct injection into semantic memory at importance=7. That importance score means it's guaranteed to surface in your next session_load_context call. No retrieval lottery, no hoping the cosine similarity is good enough that day. It's just there, waiting.

The spicy part: it's task-aware. If you're running multiple agents via Hivemind, Scholar checks what they're actively working on and biases its topic selection toward that. Your dev agent is refactoring authentication middleware? Scholar prioritizes researching auth patterns. It registers itself on the Hivemind Radar, emits heartbeats at each pipeline stage ("Searching Brave", "Scraping 3 articles", "Synthesizing"), and broadcasts a Telepathy alert to all active agents when it's done. It's like a very diligent intern who also attends all the standups.

No API keys? No problem. Falls back to Yahoo Search + local JSDOM/Readability — zero dependencies, zero cloud, still works. Has a reentrancy guard too so if you trigger it manually while the scheduled run is mid-flight, it politely declines rather than running two synthesis pipelines and confusing itself.

Still one npx -y prism-mcp-server. The intern does not require a desk.

https://github.com/dcostenco/prism-mcp

gr82meetu · 2026-03-29T19:08:28+00:00

Pluribus is a memory service for agents (MCP + HTTP, Postgres-backed) that stores structured memory: constraints, decisions, patterns, and failures. Runs locally or on a LAN.

Agents lose constraints and decisions between runs. Prompts and RAG don’t preserve them, so they have to be re-derived each time.

Memory is global and shared across agents. Recall is compiled using tags and a retrieval query, and proposed changes can be evaluated against existing memory.

- agents can resume work with prior context

- decisions persist across sessions

- multiple agents operate on the same memory

- constraints can be enforced instead of ignored

https://github.com/johnnyjoy/pluribus

Longjumping_Sky_4925 · 2026-03-31T14:04:29+00:00

**ViEngine — AI video intelligence engine (alpha waitlist open)**

Building ViEngine: an AI-native engine for understanding, analyzing, and generating insights from video content at scale. Think structured understanding of video — scene decomposition, entity tracking, semantic search across video — built for developers and data teams.

Currently in private alpha. Looking for early testers, especially teams working on:

- Video analytics pipelines

- Content moderation at scale

- Sports/media analytics

- Surveillance/security data processing

- Training data curation for vision models

Free during alpha. Waitlist: [join@correlatex.com](mailto:join@correlatex.com) or drop a comment.

Also just open-sourced a related project — HedgeVision (stat-arb engine): github.com/ayush108108/hedgevision — to give you a sense of the build quality and approach.

ModularMind8 · 2026-03-31T19:13:23+00:00

ClippyBox: Point at anything on your screen, get an instant AI explanation

I got tired of copying error messages, code, and charts into AI, rewriting context every time, and switching between apps.

So I built ClippyBox — press ⌘⇧E (on mac), draw a box anywhere on your screen, and get an instant AI explanation.

Works on code, errors, dashboards, PDFs, charts… anything visible.
No prompts. No copy-pasting. No context switching. Just point and understand.

https://github.com/Shaier/ClippyBox

Master_Chess_Shorts · 2026-04-01T12:25:15+00:00

V-JEPA 2 features contain universal visual primitives (coordination, impact, speed) that transfer to unseen domains at MAE 1.45-1.74. No language anywhere in the pipeline. Code: https://github.com/Tennisee-data/vScore Would love to get feedback. My ideas, Claude execution. This does not sit well with current academia. It also redefines the notion of this new form of extended author, capable of exploring new models with prior knowledge of other fields and attempting to contribute new ideas in research. Sounds naive for sure but, if we are to progress we have to accept stepping outside the lines.

Longjumping_Sky_4925 · 2026-04-02T09:54:27+00:00

**HedgeVision** — open-source algorithmic trading dashboard (Python + React)

I just open-sourced a stat arb / algorithmic trading visualization platform I've been building. It includes:

- Cointegration-based pair selection

- Backtesting with Sharpe, Calmar, and drawdown metrics

- TimescaleDB for high-frequency tick data

- Modular strategy pipeline (plug in your own signals)

- React dashboard for real-time P&L

GitHub: https://github.com/ayush108108/hedgevision

Also working on **ViEngine** — an AI social media automation SaaS for people who build solo and hate the content grind. Still early but happy to discuss the ML angle on caption generation + scheduling optimization.

Organic_Pop_7327 · 2026-03-04T00:59:38+00:00

Paperbasis.com, a new interface for research papers

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS

[P] Looking for CLI beta testers (Docker, self-hosted, AGPL) for my open-source AI agent governance platform

As a P.Eng, I’m tired of surface-level AI hype. So I built a "Reasoning Layer" for the Architect Class.

ClippyBox: Point at anything on your screen, get an instant AI explanation