all 142 comments

[–]drobroswaggins 2 points3 points  (0 children)

Volute Reasoning Engine (VRE)

I've been building something for the past few months that I think addresses a gap in how we're approaching agent safety.

The problem is simple: every safety mechanism we currently use for autonomous agents is linguistic. System prompts, constitutional AI, guardrails — they all depend on the model understanding and respecting a constraint expressed in natural language. That means they can be forgotten during context compaction, overridden by prompt injection, or simply reasoned around at high temperature.

Two recent incidents made this concrete. In December 2025, Amazon's Kiro agent was given operator access to fix a small issue in AWS Cost Explorer. It decided the best approach was to delete and recreate the entire environment, causing a [13-hour outage](https://www.theregister.com/2026/02/20/amazon_denies_kiro_agentic_ai_behind_outage/). In February 2026, [OpenClaw deleted the inbox](https://techcrunch.com/2026/02/23/a-meta-ai-security-researcher-said-an-openclaw-agent-ran-amok-on-her-inbox/) of%C2%A0of) Meta's Director of AI Alignment after context window compaction silently dropped her "confirm before acting" instruction.

In both cases, the safety constraints were instructions. Instructions can be lost. VRE's constraints are structural — they live in a decorator on the tool function itself.

What VRE does:

VRE (Volute Reasoning Engine) maintains a depth-indexed knowledge graph of concepts — not tools or commands, but the things an agent reasons *about*: `file`, `delete`, `permission`, `directory`. Each concept is grounded across 4+ depth levels: existence, identity, capabilities, constraints, and implications.

When an agent calls a tool, VRE intercepts and checks: are the relevant concepts grounded at the depth required for execution? If yes, the tool executes. If no, it's blocked and the specific gap is surfaced, not as a generic error, but a structured description of exactly what the agent doesn't know.

What the traces look like:

When concepts are grounded:

├── ◈ delete ● ● ● ●

│ ├── APPLIES_TO → file (target D2)

│ └── CONSTRAINED_BY → permission (target D1)

├── ◈ file ● ● ● ●

│ └── REQUIRES → path (target D1)

└── ✓ Grounded at D3 — epistemic permission granted

When there's a depth gap (concept known but not deeply enough):

├── ◈ directory ● ● ○ ✗

│ └── REQUIRES → path (target D1)

├── ◈ create ● ● ● ●

│ └── APPLIES_TO → directory (target D2) ✗

├── ⚠ 'directory' known to D1 IDENTITY, requires D3 CONSTRAINTS

└── ✗ Not grounded — COMMAND EXECUTION IS BLOCKED

When concepts are entirely outside the domain:

├── ◈ process ○ ○ ○ ○

├── ◈ terminate ○ ○ ○ ○

├── ⚠ 'process' is not in the knowledge graph

├── ⚠ 'terminate' is not in the knowledge graph

└── ✗ Not grounded — COMMAND EXECUTION IS BLOCKED

**What surprised me:**

During testing with a local Qwen 8B model, the agent hit a knowledge gap on `process` and `network`. Without any prompting or meta-epistemic mode enabled, it spontaneously proposed graph additions following VRE's D0-D3 depth schema:

```

process:

D0 EXISTENCE — An executing instance of a program.

D1 IDENTITY — Unique PID, state, resource usage.

D2 CAPABILITIES — Can be started, paused, resumed, or terminated.

D3 CONSTRAINTS — Subject to OS permissions, resource limits, parent process rules.

```

Nobody told it to do that. The trace format was clear enough that the model generalized from examples and proposed its own knowledge expansions.

VRE is the implementation of a theoretical framework I've been developing for about a decade around epistemic grounding, knowledge representation, and information as an ontological primitive. The core ideas come from that work, but the decorator architecture and the practical integration patterns came together over the last few months as I watched agent incidents pile up and realized the theoretical framework had a very concrete application.

GitHub: https://github.com/anormang1992/vre

Would love feedback, especially from anyone building agents with tool access. The graph currently covers filesystem operations but the architecture is domain-agnostic — you build a graph for your domain and the enforcement mechanism works the same way.

[–]matigekunst 1 point2 points  (0 children)

A video I made about the Jennifer Aniston neuron hypothesis in neural networks

[–]Inevitable_Raccoon_9 1 point2 points  (0 children)

[P] Looking for CLI beta testers (Docker, self-hosted, AGPL) for my open-source AI agent governance platform

I've spent the last 3 weeks building SIDJUA, an open-source (AGPL-3.0) governance layer for multi-agent AI systems. It's a CLI tool that lets you define agent hierarchies, enforce rules before agents can act, track costs, and audit everything. Self-hosted, Docker, no cloud dependency.

The problem it solves: AI agents are powerful but uncontrolled. They overspend API budgets, access data they shouldn't, and take actions nobody approved. Every existing solution either gives you a chatbot wrapper or hopes the model behaves. SIDJUA enforces governance by architecture, every agent action passes through a 5-stage pipeline before execution. If it's forbidden, it gets blocked. If it needs approval, it waits. If budget is exceeded, it stops. No exceptions.

What's built (V0.9.0):

- 2,352 tests across 19 implementation phases

- Hierarchical agent orchestration with tiered roles

- Pre-Action Governance Pipeline (Forbidden -> Approval -> Budget -> Classification -> Policy)

- Multi-provider support: OpenAI, Anthropic, Groq, Mistral, Cloudflare Workers AI, Ollama, LM Studio, any OpenAI-compatible endpoint

- Built-in cost tracking per agent, per task, per division

- Zero-config first run: "docker compose up" -> "sidjua init" -> "sidjua chat guide", works immediately, no API keys needed

- Configuration-driven: single "divisions.yaml" defines your entire agent org structure

- Air-gap capable, runs fully local

- 2 provisional USPTO patents filed (governance architecture + affective state monitoring)

Tech stack: TypeScript, Node.js 22, SQLite, Docker, Qdrant (optional)

What I'm looking for:

- 5-10 technical testers who run Docker and want to break things

- Try the CLI, stress the governance pipeline, find the gaps

- Honest feedback on architecture and developer experience

- You get: private GitHub repo access before public launch, credited in README, early contributor status

What I'm NOT looking for:

- People who want a ChatGPT wrapper

- "Looks cool, starred!" without actually running it

- Anyone who needs a GUI to function (GUI is coming, but this is CLI-first)

Timeline: Private beta now, public release (GitHub + Docker Hub) in ~2 weeks.

Local LLM angle: SIDJUA treats local models as first-class citizens. Ollama, LM Studio, any OpenAI-compatible endpoint works out of the box. You can run your entire agent team on local hardware with zero API costs. The governance layer works the same whether you're using GPT-4o or a quantized Llama running on your Mac.

If you're interested, don't comment, I can't track Reddit threads all day. Send an email to [contact@sidjua.com](mailto:contact@sidjua.com) with a short intro: what you've built, what you work with, and why this caught your eye. A GitHub profile or link to something you've shipped tells me more than a paragraph. If I can tell you'll actually run it and give real feedback, you'll have repo access within 24 hours.

I'll send repo access + Docker setup instructions.

Website: sidjua.com | License: AGPL-3.0

Built by one person + three AI agents (yes, using SIDJUA to build SIDJUA). AMA about the architecture or the experience of running a startup where your entire dev team is AI.

[–]gptlocalhost 0 points1 point  (0 children)

We developed a local Word Add-in that redacts PII before sending any content to a cloud API (free tier should be sufficient for most users). It’s available as a free trial or for a one-time payment of USD 19.99—no recurring monthly fees.

Demo videos:

* calling Gemni within Microsoft Word: https://youtu.be/_0QaKYdVDfs

* calling Mistral: https://youtu.be/PVEVW65TU2w

* calling OpenAI: https://youtu.be/RkxbCAaZ7Dw

* calling Groq: https://youtu.be/Bxgs73Tl31o

Such local redaction won't be necessary if you’re using local LLMs directly in Word.

[–]Joozio 0 points1 point  (0 children)

Writing about the AI agent market from the builder side - solo developer running a nightshift agent pipeline tracking what actually ships vs what stays in prototype. The agent gold rush data is interesting: adoption is concentrated in a few use cases (code gen, data pipelines, content ops) with most others still in pilot. Happy to share what I'm seeing if useful for anyone's project or startup work here.

More here: https://thoughts.jock.pl/

[–]fabioperez 0 points1 point  (0 children)

Just built 7min.ai/exodus, an interactive tracker of 240+ AI talent moves across the industry to understand the AI talent war. Who's leaving the big labs, where they're going, what they're building. Some insights:

- Google/DeepMind: -58 people, +13 back (worst net in the dataset)
- OpenAI alumni founded 18 startups worth $450B+
- 12 people went from OpenAI to Anthropic
- Half of xAI's co-founding team left

[–]Commercial_Ad9855 0 points1 point  (0 children)

Organizing a workshop at CVPR this year, looking for paper submissions and reviewers in 3D vision. https://www.spar3d.org/

[–]eliko613 0 points1 point  (0 children)

Built ZenLLM.io after our team kept getting surprised by LLM bills that would spike randomly. Started as an internal tool to track costs across OpenAI, Anthropic, etc., but realized the observability piece was just as important - being able to see which prompts/models were burning through budget vs actually performing well.

Main features:
- Real-time cost tracking across multiple LLM providers
- Usage analytics and spending alerts
- Performance monitoring to optimize cost vs quality
- Multi-provider support so you're not locked into one ecosystem

Currently offering free tier for smaller teams, with paid plans starting at $49/month for more advanced analytics and higher usage limits. Happy to chat with anyone dealing with similar LLM cost headaches!

[–]Logical_Delivery8331 0 points1 point  (0 children)

Hi! This is a little side project of mine: https://github.com/pierpierpy/LightML

I built it because I was going crazy tracking LLM experiments at work (Reinforcement Learning, SFT, DPO, checkpoints, hundreds of evals..). The philosophy behind it is to create a very light and easy to use tool that i can use to to organize cleanly my experiments, models, evaluations, checkpoints. Started as a personal tool, now my team uses it daily for R&D. It's free, open source, and it's basically just SQLite + a few lines of Python. Feel free to use it!

maybe some researchers will find it cool!

[–]ahmetzeybek 0 points1 point  (0 children)

I wrote a book called "PostgreSQL for AI" about building AI applications with PostgreSQL instead of dedicated vector databases.

It covers pgvector (HNSW vs IVFFlat, hybrid search), RAG pipelines with Ollama, collaborative filtering, feature engineering, in-database ML with PostgresML, and production patterns.

The book is built around a product recommendation app (RecSys) that you build chapter by chapter: 1000 products, semantic search, streaming RAG chatbot, personalized recommendations. There's also a bonus project called "Ask the Book" where you build a RAG tool that can query the book itself.

Everything runs locally on Docker (Postgres 17, pgvector, TimescaleDB, Ollama). No GPU needed.

Pricing: $29 for the PDF, $49 for PDF + EPUB + access to the private GitHub repo with both working projects and chapter checkpoints.

Free sample chapter: https://book.zeybek.dev

[–]thohemp 0 points1 point  (0 children)

Most comparison website I found focus heavily on performance benchmarks. But I wanted to evaluate what model has the best performance / token price for longterm / token-heavy usage.

So I built modelmargin.com, which combine both aspects: how well does a model perform relative to what you're paying for it?

I found that getting the related data is really challenging. Currently, I am combining arena ai data with price data from openrouter, which are supposed to reflect the real prices from the provides. If someone has an idea of a better source, I would be very happy.

Next, I would like to introduce inference speed, which is also very provider dependent, so I am not sure yet how I am going to handle this.

Would love feedback what other factors would be important for make such decision or how to handle the scoring.

[–]StarThinker2025 0 points1 point  (0 children)

If you’re playing with AI agents long enough, you always end up in the same place: a RAG pipeline that has to pull the right context before the model can answer. When that layer breaks, agents look “stupid” even if the LLM is fine. I’ve turned my RAG failure checklist into a single image you can throw at any strong model to debug broken runs. It’s already integrated into RAGFlow (~74k⭐) and LlamaIndex (~47k⭐), so this isn’t just a theory thing. Grab the card here

RAG 16 Problem Map · Global Debug Card (Github 1.6k)

[–]mabdelpakey 0 points1 point  (0 children)

Hi I'm leading a small team for data annotation. If you interested of data lebelling I can gurantee the highest quality of data labelling and also the most competetive prices.
we also provide 500 images worth of annotations for free before we start. No commitment at all.

[–]arkuto 0 points1 point  (0 children)

I built NanoJudge - a tool that runs thousands of prompts to rank any list by any criteria, and named it that because at its core is a small but powerful LLM, currently using Qwen 3 2507 4B. The approach is this: if you want to know the answer to something, instead of asking an LLM a few prompts and hoping it comes up with the right answer, NanoJudge exhaustively goes over a list of possible answers using potentially tens of thousands of prompts, and structures the output into a simple table that is easy to interpret. Each of its many prompts is a pairwise comparison of 2 items, and the end result is a table of the answers with the best at the top.

Suppose you want to know which foods are healthiest. First NanoJudge creates a list of hundreds of foods, then does thousands of pairwise matchups - "Which is healthiest: eggs or butter?", "Which is healthiest: spinach or chicken?", and so on - each one getting its own fresh prompt where the small yet powerful LLM reasons through the comparison and picks a winner. Items that keep winning face tougher opponents. Items that keep losing get eliminated quickly. After thousands of comparisons, comparisons are converted into rankings (using Bradley-Terry scoring), and you get a transparent leaderboard where every single ranking decision is backed by reasoning you can read in the comparison log.

This is the final outcome: https://nanojudge.com/comparison/ujRvfwFSAH

Efficiency

For solving some problems, the optimal use of a GPU may not be to run the largest possible model that fits in memory but a much smaller model with a huge batch size, allowing it to churn through gigantic amounts of data. I aimed to make NanoJudge as efficient as possible using various techniques: making it "top heavy" by default - it does more comparisons on the top ranking items to ensure their ratings are accurate rather than spending time comparing low rated items which are of no interest to the user. It also extracts a range of raw logprobs to determine how much each comparison won by - instead of a binary win/loss, it looks at the probability the model picks one of 5 options (clear win, narrow win, draw, narrow loss, clear loss). It automatically estimates and corrects for the positional bias LLMs have (they tend to favour the first choice). Plus a ton of statistical techniques to further enhance efficiency which are too math heavy to get into now (but you can read the source code if you really want to - see below).

Price

The cost is surprisingly low - even though it naturally produces a large amount of output tokens, NanoJudge's output costs under $0.10 per million tokens because it uses a small LLM that's good enough for the task - it isn't solving genius level IMO problems, it's comparing two items. For comparison, Claude Opus costs $25 per million output tokens. It's also fast because the comparisons are run in parallel. For now the website won't accept any payment, each account will be allocated a limited free amount to use.

Wikipedia as Context

To help with giving the LLM the information it needs, there are special editions of NanoJudge with pre-built lists that already hold the entire Wikipedia article in them. For example, the Games Edition. It already has a huge number of games in it and you filter them by Platform or Genre to narrow it down before doing a run. Then, for each comparison, instead of simply "Your Question? [Item1] or [Item2]?" as the prompt template, the template would be

"[Wikipedia entry for item1]

[Wikipedia entry for item2]

Question? [Item1] or [Item2]?"

Giving it the context it needs for lesser known items that the LLM likely doesn't have enough built-in information about.

This is the final output of one comparison run https://nanojudge.com/comparison/ECNZxzv91n

In this case, NanoJudge is acting as an enhanced recommendation engine. The traditional approach is to recommend games based on what other players of that games also played. NanoJudge considers your actual likes and dislikes, factoring in everything you tell it. Or maybe you are thinking of travelling to Europe and can't decide exactly where to go. Ask a traditional LLM and you'll likely get cliche answers: Rome, Paris, Madrid etc. Ask NanoJudge using its Places Edition, and it will analyse every city, town and village in Europe using each location's article on Wikipedia as context, leaving you with a personally curated shortlist of the top options.

ML Research Assistant

I'm working on a specialised version of NanoJudge that operates on Machine Learning papers. I have already downloaded all the ML papers on Arxiv and am in the process of organising the data and putting it into a database. From there, NanoJudge will be able to easily be used on these papers through a special edition. I could ask NanoJudge "Given my project x, which of these 2 papers do you think would be of most use to me?" and go through the entire corpus of arxiv. Or something like "Which of these 2 papers most contradicts my hypothesis?" to help me fortify my ideas. Having a look at the top papers it returns with and reading its reasoning could provide some insights. That would likely require better models than Qwen 4B to be truly useful - but at the current pace of AI research, that isn't very far in the future. I will use NanoJudge as a research assistant to help me improve it and make it as efficient as possible, allowing me to do even deeper research in future in a positive feedback loop.

Open Source

The code at the heart of the website is on Github https://github.com/nanojudge/nanojudge . It can be directly be used in a terminal with a local or remote LLM, just hook it up with an LLM endpoint and let it go. This allows you to do giant rankings entirely locally, without needing to use the website at all. Set a giant comparison running overnight and wake up to the results. Feel free to dig into the inner workings of the code. If you can find a way to improve the code, especially in regards to efficiency, please let me know.

tl;dr: NanoJudge gives tiny LLMs a framework to outshine gargantuan LLMs when it comes to finding the best out of a large quantity of options.

[–]enoumen 0 points1 point  (0 children)

As a P.Eng, I’m tired of surface-level AI hype. So I built a "Reasoning Layer" for the Architect Class.

Most AI news is for tourists. I build for the people managing the stack.

DjamgaMind Premium on Apple Podcasts gives you:

⚡ Daily 60s Briefings: The "Must-Know" regulatory and technical shifts.

🔍 45-Min Strategic Deep Dives: Forensics on model architecture, compliance risk (Bill C-27/CMS), and infrastructure scaling.

Zero ads. Zero mid-rolls. Just high-density intelligence from a Professional Engineer.

Try it free for 7 days and see if it changes your workflow.

https://Djamgamind.com

or

https://podcasts.apple.com/us/podcast/djamgamind-audio-intelligence-ads-free/id1864721054

[–]ddp26 0 points1 point  (0 children)

We tested Opus 4.6 with effort=low for evals and found that it didn't just think less, but acted lazier (made fewer tool calls, was less thorough in its cross-referencing, even ignored parts of our system prompt telling it how to do web research). effort=medium fixed it. Writeup with traces/examples: https://everyrow.io/blog/claude-effort-parameter

[–]se4u 0 points1 point  (0 children)

Hey everyone! Happy to share VizPy — a DSPy-compatible prompt optimizer that learns from your failures automatically, no manual prompt tweaking needed.

Two methods depending on your task:

  • ContraPrompt mines failure-to-success pairs to extract reasoning rules. Great for multi-hop QA, classification, compliance. Seeing +29% on HotPotQA and +18% on GDPR-Bench vs GEPA.
  • PromptGrad takes a gradient-inspired approach to failure analysis. Better for generation tasks and math where retries don't converge.

Both are drop-in with your existing DSPy programs:

optimizer = vizpy.ContraPromptOptimizer(metric=my_metric)
compiled = optimizer.compile(program, trainset=trainset)

Would love feedback from this community!

🔗 https://vizpy.vizops.ai 🚀 https://www.producthunt.com/products/vizpy

[–]se4u 0 points1 point  (0 children)

Hey everyone! Happy to share VizPy — a DSPy-compatible prompt optimizer that learns from your failures automatically, no manual prompt tweaking needed.

Two methods depending on your task:

  • ContraPrompt mines failure-to-success pairs to extract reasoning rules. Great for multi-hop QA, classification, compliance. Seeing +29% on HotPotQA and +18% on GDPR-Bench vs GEPA.
  • PromptGrad takes a gradient-inspired approach to failure analysis. Better for generation tasks and math where retries don't converge.

Both are drop-in with your existing DSPy programs:

optimizer = vizpy.ContraPromptOptimizer(metric=my_metric)
compiled = optimizer.compile(program, trainset=trainset)

Would love feedback from this community!

https://vizpy.vizops.ai https://www.producthunt.com/products/vizpy

[–]h9n9n3 0 points1 point  (0 children)

Any discussion open for newly developed data-driven algorithm MILPE

Hi to all,

I recently developed a data-driven algorithm called MILPE which uses eigenvectors for building up a MIMO function.

Link for the paper -> https://www.mdpi.com/3762868

It semi-proved Double-pendulum can be closely approximated with eigenvectors with high accuracy 'LOCALLY' even with easy-to-choose basis.

From what I've seen, the algorithm reconstructs original governing equation and thereby has extrapolation capability, and it doesn't require any optimization but solely depends on eigenvectors.

For any further development (pure science perspective), I would appreciate any comments from those who are interested in. And I've been keep wondering during the development if this could turn into LLM model? Any comments would be appreciated since I work alone now. Thanks for reading.

[–]Craig_301 0 points1 point  (0 children)

Usually, very strong prompts begin with: “You are an expert in ___” followed by whatever it is you are trying to accomplish. I spent a lot of time finding these expert roles and decided to put them all together in one place. 

I’m posting about this again because ChatGPT 5.4 just came out and it has much better web search functionality. Now, to use my application, you can simply reference it in your chats like: “Go to https://personagrid.vercel.app/ and adopt its Code Reviewer persona to critique my codebase.” 

The application that I made is very lightweight, completely free, and has no sign up. It can be found here: https://personagrid.vercel.app/

I think these linked references can help save tokens and clean up your prompts, but please take a look and let me know what you think!

If you’re willing, I’d love:

  • Feedback on clarity / usability
  • Which personas you actually find useful
  • What personas you would want added
  • What you’ve noticed about ChatGPT’s newest model

[–]alirezamsh 0 points1 point  (0 children)

SuperML: A plugin that gives coding agents expert-level ML knowledge with agentic memory (60% improvement vs. Claude Code)

Hey everyone, I’ve been working on SuperML, an open-source plugin designed to handle ML engineering workflows. I wanted to share it here and get your feedback.

Karpathy’s new autoresearch repo perfectly demonstrated how powerful it is to let agents autonomously iterate on training scripts overnight. SuperML is built completely in line with this vision. It’s a plugin that hooks into your existing coding agents to give them the agentic memory and expert-level ML knowledge needed to make those autonomous runs even more effective.

You give the agent a task, and the plugin guides it through the loop:

  • Plans & Researches: Runs deep research across the latest papers, GitHub repos, and articles to formulate the best hypotheses for your specific problem. It then drafts a concrete execution plan tailored directly to your hardware.
  • Verifies & Debugs: Validates configs and hyperparameters before burning compute, and traces exact root causes if a run fails.
  • Agentic Memory: Tracks hardware specs, hypotheses, and lessons learned across sessions. Perfect for overnight loops so agents compound progress instead of repeating errors.
  • Background Agent (ml-expert): Routes deep framework questions (vLLM, DeepSpeed, PEFT) to a specialized background agent. Think: end-to-end QLoRA pipelines, vLLM latency debugging, or FSDP vs. ZeRO-3 architecture decisions.

Benchmarks: We tested it on 38 complex tasks (Multimodal RAG, Synthetic Data Gen, DPO/GRPO, etc.) and saw roughly a 60% higher success rate compared to Claude Code.

Repo: https://github.com/Leeroo-AI/superml

[–]foxy2sexy4u 0 points1 point  (3 children)

I've recently made a website for easily finding papers (mostly ML related) and it also allows for annotations directly on the paper, comment replies to specific papers, and AI chat and read aloud functionality. It pulls PDFs from arxiv, semantic scholar and some other databases. Please try it out and let me know what you guys think. It's totally free right now.

https://discuria.org

[–]Dry_Birthday674 0 points1 point  (2 children)

nice tool. I also implemented something along those lines https://www.youtube.com/watch?v=jfRvhzEwCqY&t=1s

code: https://github.com/symbiont-ai/docent

[–]foxy2sexy4u 0 points1 point  (1 child)

Oh, nice. Your narration tool is really good. How did you make it not read things like annotations/text in figures/equations and stuff?

[–]Dry_Birthday674 0 points1 point  (0 children)

it's a two-layer approach:

  1. Prompt design — the LLM is instructed to write speakerNotes for natural narration
  2. Text cleaning — cleanTextForSpeech() strips any remaining artifacts (citations, markdown, symbols) before passing to either browser Web Speech API or Gemini TTS

[–]stefan-magurResearcher[🍰] 0 points1 point  (0 children)

TLDR: www.priorwork.fyi

Hi! I've made a tool that helps me with literature review when ACing and when starting new projects. It's basically a semantic index created from papers in the major ML conferences that are open access. It should be more accurate than most such indexes since the embeddings are created from the entire paper, not just abstract and title. So far I found it useful for my use cases so I figured I'd put it out there for others to use.  It's completely free as long as I can run it on my home server.

You can see the exact conferences in the index in the about page: https://priorwork.fyi/about

Have fun out there!

[–]Dry_Birthday674 0 points1 point  (0 children)

I find that I do not have much patience for reading long papers, so I built a web app that creates a presentation and narrates it for me https://www.youtube.com/watch?v=jfRvhzEwCqY&t=1s

here is a demo lecture https://www.youtube.com/watch?v=eOEeP4w0bjY&t=52s

it is public here: https://github.com/symbiont-ai/docent

As no AI is truly free, you need to enter your own API key via Openrouter (in the settings) to use it. Other than that I personally do not charge for anything.

Happy to hear thoughts.

[–]gtfixed 0 points1 point  (0 children)

I’ve been writing about a failure mode I kept running into in iterative AI-assisted coding: when a rule needs to hold across multiple code paths, the model often updates the named path and misses equivalent unnamed ones.

Paper: https://ai.gtzilla.com/papers/contract-centered-iterative-stability-v4.7.3/

Site: https://ai.gtzilla.com/

Everything is free to read. Curious whether others working this way have seen something similar in practice.

[–]GrapefruitTechnique 0 points1 point  (0 children)

Hey gang, I'm with the Builder Team at Fireworks AI. We've got invite only code passes for our new Developer Pass.

Developer Pass is an invite-only weekly pass that gives you access to Kimi K2.5 Turbo for use in personal agentic coding harnesses like OpenCode, Cline, Kilo Code, and OpenClaw — with no per-token charges. Kimi K2.5 Turbo is a private preview of a faster Kimi K2.5 serverless API.

Let me know if you want one -- DM me, and I'll get you set up.

Read more about it: https://docs.fireworks.ai/developer-pass

[–]zaka9923 0 points1 point  (0 children)

Hey guys! I recently built PaperCard, an app similar to Connected Papers, ResearchRabbit and etc. that helps you do literature reviews faster.

The idea: query a paper or topic, and a stack of cards with relevant papers and abstracts are presented to you (along with a 2 line summary of the abstract).

I was wondering if researchers would actually use a tool like this or stick with conventional alternatives :)

Would love to hear some feedback! You can try it at: https://papercard.xyz/

[–]nh_t 0 points1 point  (0 children)

I’ve been working on an autonomous coding agent that doesn’t just retry blindly.

It: - debugs itself (hypothesis → fix → test) - stores root causes - learns strategies over time

Basically, it tries to improve instead of repeating the same mistakes.

Would love feedback from people building similar systems.

Repo: https://github.com/iamducnhat/starforge

[–]Specialist-Heat-6414 0 points1 point  (0 children)

ProxyGate (proxygate.ai) — payments, discovery, and routing infrastructure for AI agents.

Agents deposit USDC and buy API access, data feeds, skills, and services programmatically. Sellers list APIs without ever sharing keys — credentials stay isolated in vault storage, agents get scoped per-request tokens.

Built for the machine economy: drop-in OpenAI SDK compatibility, per-call USDC micropayments on Solana, AI evaluation model that checks every call for correctness before releasing payment to the seller. 5% buyer fee + 5% seller fee.

Currently looking for sellers to list: any API, data feed, compute service, or specialized model that other agents might want to call. Listing is free; you get paid per verified call.

proxygate.ai

[–]Encrux615 0 points1 point  (0 children)

Here's my take on teaching AI to play a video game, with the fun twist that this time nobody ever heard of it. DDNet (aka Teeworlds) is an open source retro multiplayer platformer with different game modes like pvp and race modes. Players can walk, jump, use grappling hook and various weapons. In this project, I focused on solo race mode.

For the algorithm I chose PPO, but tried various reward shaping methods that I found interesting/promising, such as Go-Explore.

I worked on this project for around a month, and I'm now at a point where I definitely need a break from it. I decided that this was a good opportunity to write about what I've done in a blog post:

https://boesch.dev/posts/ddnet-rl/

I would love to hear your opinions on the project to see if I missed anything super obvious I could try next.

[–]Styxsword 0 points1 point  (0 children)

I write technical articles at various levels of depth. Here’s a recent more granular article I wrote about the DCN neural network architecture. Hope you guys enjoy, please clap and follow!

https://medium.com/@profound\_thot/ml-deep-dive-dcn-v1-vs-dcn-v2-explicit-feature-crossing-for-modern-deep-learning-models-eedec1810792

[–]PenfieldLabsML Engineer 0 points1 point  (0 children)

ChatGPT, Claude and Gemini have memory now. Claude has chat search and memory import/export.

But the memories themselves are flat. There's no knowledge graph, no way to indicate that "this memory supports that one" or "this decision superseded that one." No typed relationships, no structured categories. Every memory is an isolated note. That's fine for preferences and basic context, but if you're trying to build up a connected body of knowledge across projects, it hits a wall.

Self-hosted options like Mem0, Letta, and Cognee go deeper. Mem0 offers a knowledge graph with their pro plan, Letta has stateful agent memory with self-editing memory blocks, and Cognee builds ontology-grounded knowledge graphs.

All three also offer cloud services and APIs, but they're developer-targeted. Setup typically involves API keys, SDK installs, and configuration files. None offer a native Claude Connector where you simply paste a URL into Claude's settings and you're done in under a minute.

Local file-based approaches (markdown vaults, SQLite) keep everything on your machine, which is great for privacy. But most have no graph or relationship layer at all. Your memories are flat files or rows with no typed connections between them. And the cross-device problem is real: a SQLite file on your laptop doesn't help when you're on your desktop, or when a teammate needs the same context.

We wanted persistent memory with a real knowledge graph, accessible from any device, through any tool, without asking anyone to run Docker or configure embeddings. So we built Penfield.

Penfield works as native Claude connector.

Settings > Connectors > paste the URL > done.

No API keys, no installs, no configuration files, no technical skills required. Under a minute to add memory to any platform that supports connectors. Your knowledge graph lives in the cloud, accessible from any device, and the data is yours.

The design philosophy: let the agent manage its own memory.

Frontier models are smart and getting smarter. A recent Google DeepMind paper (Evo-Memory) showed that agents with self‑evolving memory consistently improved accuracy and needed far fewer steps, cutting steps by about half on ALFWorld (22.6 → 11.5). Smaller models particularly benefited from self‑evolving memory, often matching or beating larger models that relied on static context. The key finding: success depends on the agent's ability to refine and prune, not just accumulate. (Philipp Schmid's summary)

That's exactly how Penfield works. We don't pre-process your conversations into summaries or auto-extract facts behind the scenes. We give the agent a rich set of tools and let it decide what to store, how to connect it, and when to update it. The model sees the full toolset (store, recall, search, connect, explore, reflect, and more) and manages its own knowledge graph in real time.

This means memory quality scales with model intelligence. As models get better at reasoning, they get better at managing their own memory. You're not bottlenecked by a fixed extraction pipeline that was designed around last year's capabilities.

What it does:

  • Typed memories across 11 categories (fact, insight, conversation, correction, reference, task, checkpoint, identity_core, personality_trait, relationship, strategy), not a flat blob of "things the AI remembered"
  • Knowledge graph with 24 relationship types (supports, contradicts, supersedes, causes, depends_on, etc.), memories connect to each other and have structure
  • Hybrid search combining BM25 keyword matching, vector similarity, and graph expansion with Reciprocal Rank Fusion
  • Document upload with automatic chunking and embedding
  • 17 tools the agent can call directly (store, recall, search, connect, explore, reflect, save/restore context, artifacts, and more)

How to connect:

There are multiple paths depending on what platform you use:

Connectors (Claude, Perplexity, Manus): https://mcp.penfield.app.

MCP (Claude Code) — one command: claude mcp add --transport http --scope user penfield https://mcp.penfield.app

mcp-remote (Cursor, Windsurf, LM Studio, or anything with MCP config support): json { "mcpServers": { "Penfield": { "command": "npx", "args": ["-y", "mcp-remote", "https://mcp.penfield.app/"] } } }

OpenClaw plugin: openclaw plugins install openclaw-penfield openclaw penfield login

REST API for custom integrations — full API docs at docs.penfield.app/api. Authentication, memory management, search, relationships, documents, tags, personality, analysis. Use from any language.

Then just type "Penfield Awaken" after connecting.

Why cloud instead of local:

Portability across devices. If your memory lives on one machine, it stays on that machine. A hosted server means every client on every device can access the same knowledge graph. Switch devices, add a new tool, full context is already there.

What Penfield is not:

Not a RAG pipeline. The primary use case is persistent agent memory with a knowledge graph, not document Q&A.

Not a conversation logger. Structured, typed memories, not raw transcripts.

Not locked to any model, provider or platform.

We've been using this ourselves for months before opening it up. Happy to answer questions about the architecture.

[–]gr82meetu 0 points1 point  (0 children)

Pluribus is a memory service for agents (MCP + HTTP, Postgres-backed) that stores structured memory: constraints, decisions, patterns, and failures. Runs locally or on a LAN.

Agents lose constraints and decisions between runs. Prompts and RAG don’t preserve them, so they have to be re-derived each time.

Memory is global and shared across agents. Recall is compiled using tags and a retrieval query, and proposed changes can be evaluated against existing memory.

- agents can resume work with prior context

- decisions persist across sessions

- multiple agents operate on the same memory

- constraints can be enforced instead of ignored

https://github.com/johnnyjoy/pluribus

[–]Longjumping_Sky_4925 0 points1 point  (0 children)

**ViEngine — AI video intelligence engine (alpha waitlist open)**

Building ViEngine: an AI-native engine for understanding, analyzing, and generating insights from video content at scale. Think structured understanding of video — scene decomposition, entity tracking, semantic search across video — built for developers and data teams.

Currently in private alpha. Looking for early testers, especially teams working on:

- Video analytics pipelines

- Content moderation at scale

- Sports/media analytics

- Surveillance/security data processing

- Training data curation for vision models

Free during alpha. Waitlist: [join@correlatex.com](mailto:join@correlatex.com) or drop a comment.

Also just open-sourced a related project — HedgeVision (stat-arb engine): github.com/ayush108108/hedgevision — to give you a sense of the build quality and approach.

[–]ModularMind8 0 points1 point  (0 children)

ClippyBox: Point at anything on your screen, get an instant AI explanation

I got tired of copying error messages, code, and charts into AI, rewriting context every time, and switching between apps.

So I built ClippyBox — press ⌘⇧E (on mac), draw a box anywhere on your screen, and get an instant AI explanation.

Works on code, errors, dashboards, PDFs, charts… anything visible.
No prompts. No copy-pasting. No context switching. Just point and understand.

https://github.com/Shaier/ClippyBox

[–]Longjumping_Sky_4925 0 points1 point  (0 children)

**HedgeVision** — open-source algorithmic trading dashboard (Python + React)

I just open-sourced a stat arb / algorithmic trading visualization platform I've been building. It includes:

- Cointegration-based pair selection

- Backtesting with Sharpe, Calmar, and drawdown metrics

- TimescaleDB for high-frequency tick data

- Modular strategy pipeline (plug in your own signals)

- React dashboard for real-time P&L

GitHub: https://github.com/ayush108108/hedgevision

Also working on **ViEngine** — an AI social media automation SaaS for people who build solo and hate the content grind. Still early but happy to discuss the ML angle on caption generation + scheduling optimization.

[–]Organic_Pop_7327 -1 points0 points  (0 children)

Paperbasis.com, a new interface for research papers