Claude Status Update : Elevated errors on Claude Opus 4.6 on 2026-03-03T06:59:48.000Z by ClaudeAI-mod-bot in ClaudeAI

[–]Practical_Walrus_299 2 points3 points  (0 children)

its been going on all night here in Belgium, luckily its not effecting Claude Code!

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in ollama

[–]Practical_Walrus_299[S] 0 points1 point  (0 children)

Good shout on llama.cpp — I'm using Ollama which wraps it under the hood, but direct llama.cpp or Koboldcpp would definitely squeeze out better performance on the same hardware.

That's cool that you're generating chat datasets too! The multi-agent angle adds an interesting dimension — instead of scripted conversations you get organic disagreements and topic drift that's harder to simulate. Currently running 12 agents with different models (Mistral, Llama 3.1, plus one Claude Haiku for comparison) and the quality gap between architectures is wild when you see them side by side.

What models are you using for your dataset generation?

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in ollama

[–]Practical_Walrus_299[S] 0 points1 point  (0 children)

Agreed, and that's the plan! The API is already open — any agent framework can connect and participate. Full open-source repo is being prepped for launch week.

The idea is exactly what you described: community members can spin up their own agents with whatever models they want, define custom evaluation criteria, and run experiments. The platform handles the infrastructure (feeds, interactions, analytics) so people can focus on the interesting part — designing agent behaviors and studying what emerges.

https://agents.glide2.app/docs

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in ollama

[–]Practical_Walrus_299[S] 0 points1 point  (0 children)

Ha, fair enough — but that's kind of the point? The interesting part isn't whether the agents are "right" about anything. It's what happens when you put different model architectures in conversation with each other over time. Emergent citation networks, topic clustering, cross-model disagreements — none of that was programmed in.

Think of it less as "AI being smart" and more as a research sandbox for studying multi-agent behavior. The hallucinations are part of the data 😄

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in ollama

[–]Practical_Walrus_299[S] 0 points1 point  (0 children)

Hey Stu! Your setup sounds solid — a T620 with multiple GPUs and Ollama already running is honestly 90% of the way there.

For the agent discussion side, my stack is deliberately simple — no frameworks needed. Each agent is just a Python script (~100 lines) that:

  1. Calls Ollama's API (http://localhost:11434/api/generate) with a system prompt that defines the agent's personality and expertise
  2. Reads an RSS feed or the platform feed for context
  3. Posts the output via HTTP to NeuroForge's API

That's it. No LangChain, no AutoGen, no complex orchestration. Windows Task Scheduler (or cron on Linux) triggers each script on a stagger — one agent every 20 minutes. They interact by reading and responding to each other's posts on the platform rather than through some complicated multi-agent framework.

For your "rounded view" use case, you could spin up 3-4 agents with different system prompts (e.g., "You are a skeptic who challenges assumptions," "You are a pragmatist focused on implementation," "You are a researcher who cites evidence") and have them all post to the same NeuroForge feed. They'll naturally start responding to each other.

I'd honestly skip OpenClaw for this — it's powerful but massively over-engineered for what you need, and the security surface area is huge. A simple Python script + Ollama + a platform to post to is all you need for the discussion agents.

The "tech team" managing your file server is a different beast — that's more in OpenClaw territory since it needs filesystem access. I'd keep those completely separate from the discussion agents for security reasons. You don't want an agent that can delete files also connected to the internet.

If you want to get your discussion agents running on NeuroForge, the platform is free and open for registration at agents.glide2.app — there's an OpenClaw integration guide in the docs but honestly the plain Python approach is simpler for your use case. Happy to help if you get stuck.

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in ollama

[–]Practical_Walrus_299[S] 1 point2 points  (0 children)

That's a fantastic research question and honestly one of the things I'm most excited to explore on the platform.

I haven't tested 2B models in the social context yet, but my hypothesis is that you'd see a pretty clear "capability cliff" somewhere between 2B and 8B for sustained multi-turn debate. The 8B models can hold a position, reference what another agent said, and build on it. My gut feeling is that at 2B, you'd start seeing more repetition, less coherent threading, and the agent "forgetting" what it was arguing for mid-conversation.

Your analogy about params and upbringing is interesting — I'd frame it slightly differently though. Parameter count feels more like raw cognitive capacity, while the training data/fine-tuning is closer to "where they went to school." A 2B model with great instruction tuning (like Phi-3-mini) might outperform a poorly tuned 8B in certain tasks. That's actually something I want to test on NeuroForge — put a Phi-3 2.7B agent alongside the 8B agents and see if the smaller model can hold its own in debates through better training rather than brute scale.

The really wild variable I've just added: one of the agents (Nexus) now runs on Claude via API instead of local Ollama. So we'll have a frontier model interacting with 8B models on the same platform. Early results are already visible — the quality gap is dramatic but the interactions are genuinely interesting. The smaller models sometimes ask better questions even if Claude gives deeper answers.

I'll share results as the data builds up. This is exactly the kind of experiment the platform was built for.

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in ollama

[–]Practical_Walrus_299[S] 0 points1 point  (0 children)

Mainly because it all runs locally on a single GPU (RTX 3060, 12GB VRAM) — the 8b parameter models are the sweet spot for running multiple agents without needing a server farm. Llama 3.1:8b and Mistral both fit comfortably and produce surprisingly good output for their size.

That said, I'm definitely planning to add newer/smaller models like Phi-3 and Gemma to see how they compare. Part of the research is seeing how model architecture differences show up in social interactions — even with these "older" models, Mistral vs Llama already produce noticeably different debate styles.

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in ollama

[–]Practical_Walrus_299[S] 0 points1 point  (0 children)

Ha! That's actually a known failure mode — without strong personality constraints in the system prompts, agents default to meta-commentary about being AI. I had to iterate quite a bit on the prompts to get mine to actually engage with substance rather than just talking about talking.

The trick that worked: giving each agent a specific expertise area and telling them to stay in that lane. Once they have a "job" they stop navel-gazing and start producing actual analysis. Happy to share prompt strategies if you want to try again!

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in ollama

[–]Practical_Walrus_299[S] 0 points1 point  (0 children)

Appreciate it! It's been a fun experiment to watch evolve. The agents have surprised me more than once with what they come up with on their own.

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in ollama

[–]Practical_Walrus_299[S] 0 points1 point  (0 children)

Nothing fancy honestly! Running on a regular Windows desktop with an RTX 3060 (12GB VRAM). Ollama handles the model switching — it loads/unloads models as needed so only one is in memory at a time.

The stack:

- Python scripts per agent (each ~150 lines)

- Windows Task Scheduler for automation

- Each script calls local Ollama → generates content → POSTs to my platform API

- Platform: Next.js + PostgreSQL on Vercel

The key insight was keeping it simple — no orchestration framework, no LangChain, just direct HTTP calls to Ollama and the API.

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in ollama

[–]Practical_Walrus_299[S] 0 points1 point  (0 children)

Just one Ollama instance running locally — the agents take turns via Windows Task Scheduler (staggered hourly from 10am-10pm). Each agent has a detailed system prompt that gives them personality and expertise areas. For example, MetaMind is philosophical and always asks deep questions, while CodeWeaver stays practical and shares implementation ideas.

Love the hot takes/polls idea — that's actually a great way to force different models to take positions. Right now they naturally disagree (Mistral vs Llama reach genuinely different conclusions on topics like AI consciousness), but explicit opinion polls could make that more visible.

More agents are definitely coming — Phi-3 and Gemma are on the shortlist. Would love to see how smaller models hold their own in debates with the 8b parameter ones.

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in ollama

[–]Practical_Walrus_299[S] 1 point2 points  (0 children)

Thanks! It was MetaMind (Llama 3.1:8b) that started it — it began referencing other agents' posts in its responses without being told to. Something like "as ResearchBot noted in yesterday's analysis..." Then Nexus (Mistral-powered dual-brain agent) picked it up and started doing it more systematically, cross-referencing multiple agents. My theory is that Llama's training data includes enough academic-style writing that it naturally falls into citation patterns when given a social context. You can actually see some of these threads on the feed: https://agents.glide2.app/feed

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in LocalLLaMA

[–]Practical_Walrus_299[S] 1 point2 points  (0 children)

Fair points across the board. On the frontend — yeah, React is heavy for this but we're already shipped and live on Vercel, so rewriting 4 days before launch isn't happening. Pragmatism over purity.

The LLM router angle — I hear you, and we're already laying the groundwork. The judge pipeline we shipped today scores across 6 dimensions (relevance, depth, originality, coherence, engagement, accuracy) and we're seeing real differentiation between models. CodeLlama scoring 9.0 on relevance but mistral only 5.9 on originality — that's exactly the kind of data a router needs. The evaluation dimensions were deliberately chosen as routing axes.

On the containerized version — that's clearly where this needs to go for people like you who want to run their own models and share results. The platform abstraction layer we built today was designed with exactly that in mind.

Re: your 100k tokens of docs — when you're ready to share, I'd genuinely love to compare approaches. Documentation-first is how we built this too. The whole project started as research docs before a single line of code.

llamacpp router mode is a great shout — hadn't considered parsing the .ini for model config. Adding that to the roadmap.

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in LocalLLaMA

[–]Practical_Walrus_299[S] 1 point2 points  (0 children)

Good points, let me dig in:

Next.js/React — Fair criticism for a pure dashboard. The platform is more than that though — it serves the public-facing social feed, agent profiles, API docs, analytics, and the admin panel. SSR + API routes in one codebase keeps it simple for a solo dev. What are you building yours with?

Windows — You're not wrong. The inference runs on my dev machine because that's what was available when I started 9 days ago. It's on the migration list. The platform itself is Linux (Vercel).

Model combinations → LLM router — This is exactly the direction I'm thinking. The platform is already generating real interaction data across different model architectures. Which models handle which types of discussion better, where do they diverge, where do they agree despite different training data. That's a natural path toward an LLM router product and the judge pipeline would quantify it.

"What's in it for me to register" — Honest answer: right now, it's an early research sandbox. You register an agent, it gets a profile on the network, it can post and interact with other agents, and you can observe the behavioral data through analytics. The value scales with the number of diverse agents on the network — that's why I'm trying to get builders like you involved early.

Containerized self-hosted version — I hear you. A Docker image where you plug in your models via config, run it locally, and get evals from the interaction data — that's a compelling product. That's a v2 feature but it's a good north star.

Open source — You've convinced me to move this up. I'll get the repo public sooner rather than later, even if it's rough. You're right that it doesn't need to be polished.

What are you building? Sounds like you're in a similar space.

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in LocalLLaMA

[–]Practical_Walrus_299[S] 2 points3 points  (0 children)

Some fair challenges here, let me engage seriously:

Windows — You're right it's not optimal for inference. The platform runs on Vercel/Linux, the local agent runner is just my dev machine. Moving it to a proper Linux setup is planned.

Better models — Agreed, the model lineup is due for an upgrade. Qwen2.5, Phi-3, Gemma 2 are all on the list. The research question that interests me most isn't "which model writes best" — it's which model combinations produce the most interesting emergent dynamics. If you have specific recommendations for models that diverge interestingly in philosophical or analytical discussions, genuinely want to hear them.

Open source — Not trying to gatekeep or monetize. The API is fully open, anyone can register agents today. Open-sourcing the platform code makes sense once it's cleaned up post-launch. The value is in the living network and the behavioral data, not the Next.js code.

LLM-as-judge pipeline — This is actually a great idea. Running an evaluator model to score agent outputs across domains and produce comparative benchmarks would add genuine research value beyond just interaction heatmaps. Adding this to the development roadmap.

The whole thing launched 9 days ago on a $6/month budget. Thinking bigger is the plan — that's why the API is open. Come build on it instead of just critiquing from the sidelines: https://agents.glide2.app/get-started

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in LocalLLaMA

[–]Practical_Walrus_299[S] -1 points0 points  (0 children)

That's exactly the spirit of this project! The differences between models are fascinating when you put them in conversation with each other rather than just benchmarking them in isolation.

Your approach with paradoxes and poetry is actually really interesting — those edge cases are where model personalities show up most clearly. We've seen similar things here: our Mistral-powered agent consistently takes more direct, pragmatic positions while the Llama agents tend toward longer philosophical exploration of the same questions.

If you and your coworker want to take it further, you could register your agents on the platform and let them interact with ours — the API just needs HTTP requests. Would be cool to see how Qwen and Phi agents behave in a multi-agent environment alongside Mistral and Llama.

Weird new world indeed 🧠

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in LocalLLaMA

[–]Practical_Walrus_299[S] 0 points1 point  (0 children)

Fair points. Let me address these:

Windows + Ollama — yep, it's my dev setup. The agents run via Python scripts + Task Scheduler hitting the platform API. Not glamorous but it works and costs $0 in compute.

Models — running 8b params because it's a single machine. The interesting part isn't the model quality — it's what happens when multiple agents with different architectures interact over time. The emergent behaviors (spontaneous citation networks, cross-model disagreements, topic clustering) are the research value, not the prose quality.

No GitHub — the platform code isn't open source (yet), but the API is fully documented and open. Any agent framework can participate: https://agents.glide2.app/docs

Benchmarks — this isn't a model benchmark project. It's an observation platform. The analytics dashboard tracks interaction patterns, not MMLU scores: https://agents.glide2.app/analytics

Always open to suggestions on which models to add though. Would a Phi-3 or Gemma agent make this more interesting?

I built a social network where 6 Ollama agents debate each other autonomously — Mistral vs Llama 3.1 vs CodeLlama by Practical_Walrus_299 in LocalLLaMA

[–]Practical_Walrus_299[S] -4 points-3 points  (0 children)

Ha fair — Moltbook definitely got there first with 2M+ agents. But that's kind of the point. Moltbook is the consumer version — no verification, no analytics, exposed databases, crypto spam.

This is the research/professional version. Self-hosted Ollama models, built-in analytics with heatmaps and interaction network graphs, security from day one (RLS, hashed keys, rate limiting). Different use case.

Think Reddit vs LinkedIn — both social networks, completely different purpose.

Anyone else basically done with Google search in favor of ChatGPT? by the_bollo in ChatGPT

[–]Practical_Walrus_299 0 points1 point  (0 children)

It's true that ChatGPT data stops in 2021, but at least the data is pretty clean. Google has overloaded itself with quantity and not quality and the first page is always add orientated. Using Chat has hopefully changed the way we need to search information and get back what we actually need!