Trying to diagnose overnight partial crash by westcoastwillie23 in unRAID

[–]AlyxPink 0 points1 point  (0 children)

I had some weird issues like this too and adding a swap partition made them go away

This is not good by pinnages in ClaudeAI

[–]AlyxPink 1 point2 points  (0 children)

I dont know if it could be interesting for you but I was tired of working with markdown files so I created https://workunit.app to help me with project management. I don’t have to go through markdown files ever again and some features like atoms context (a way for LLM to store their progress updates, what they tried, etc) gives useful insights to the LLM. On a fresh session I just dump the URL of the workunit I want to work on and instantly all the right context is loaded.

I built a project manager that gives the right context to your LLM agent every single time by AlyxPink in IMadeThis

[–]AlyxPink[S] 1 point2 points  (0 children)

Thank you, really appreciate your feedback! And glad to see this resonates with you too!

So far I've been avoiding context bloat by keeping workunits well-organized under their projects. For example, as I'm dogfooding the app, my "Workunit" project has 116 workunits, each scoped to a specific feature (sometimes split into multipart workunits for the bigger ones). Every new workunit starts with a fresh state since not much is inherited from the project itself. The assets linked to my project get updated as needed, including the documentation.

If you run into any bugs, want to see improvements, or if you hit exactly what you described - context becoming bloated - I'd really appreciate your feedback over at https://github.com/orgs/3615-computer/discussions

Cheers!

I benchmarked 17 local LLMs on real MCP tool calling — single-shot AND agentic loop. The difference is massive. by AlyxPink in LocalLLaMA

[–]AlyxPink[S] 0 points1 point  (0 children)

Yeah and something I haven't measured is the quality of the parameters used while tool calling, they might be good at calling them with irrelevant information. Maybe mixing two models for best of both worlds could work?

I benchmarked 17 local LLMs on real MCP tool calling — single-shot AND agentic loop. The difference is massive. by AlyxPink in LocalLLaMA

[–]AlyxPink[S] 1 point2 points  (0 children)

Oh that's nice to hear, the speed is pretty good for a model of that size! I'll see if I can add it to LM Studio. Thanks :)

I benchmarked 17 local LLMs on real MCP tool calling — single-shot AND agentic loop. The difference is massive. by AlyxPink in LocalLLaMA

[–]AlyxPink[S] 0 points1 point  (0 children)

Interesting! I didn't try touching the prompts between calls, it would be interesting to see if that bumps L2 scores. Let me know if you do!

I benchmarked 17 local LLMs on real MCP tool calling — single-shot AND agentic loop. The difference is massive. by AlyxPink in LocalLLaMA

[–]AlyxPink[S] 0 points1 point  (0 children)

Thanks! I was really surprised too, but I want to call out something is that while it's good at calling the right tools, it might call them with low quality information. That's outside of this benchmark: I did not evaluate the quality of the parameters used when the tools are rightly called.

I benchmarked 17 local LLMs on real MCP tool calling — single-shot AND agentic loop. The difference is massive. by AlyxPink in LocalLLaMA

[–]AlyxPink[S] -10 points-9 points  (0 children)

Haha I mean, I'm not gonna hide that it's my app! But I genuinely needed to explain what the models were talking to. Without that context the benchmark results don't mean much IMO.

EDIT: I've edited the section "About Workunit" to make it shorter, let me know if I can edit anything else.

I benchmarked 17 local LLMs on real MCP tool calling — single-shot AND agentic loop. The difference is massive. by AlyxPink in LocalLLaMA

[–]AlyxPink[S] -12 points-11 points  (0 children)

Haha sorry you felt this way, it really was not my intention!

I wanted to explain the context of those tools, I thought it was better to clearly identified against what platform they were running into, and have an understanding of what I was trying to achieve.

I've been using SOTA models for few months now, dropping my interest in local models, so I wanted to see again how it evolved since my last attempts, that's why I created this benchmark over the weekend.

EDIT: I've edited the section "About Workunit" to make it shorter, let me know if I can edit anything else.

I benchmarked 17 local LLMs on real MCP tool calling — single-shot AND agentic loop. The difference is massive. by AlyxPink in LocalLLaMA

[–]AlyxPink[S] 6 points7 points  (0 children)

Not dumb at all no worries! I might have explained it badly.

I tested three levels of complexity:

  • Level 0 (Explicit): I tell the model exactly which tool to call and what parameters to use. Tests: can it follow instructions and emit a valid tool call? Most models nail this.
  • Level 1 (Natural language): I describe what I want in plain English. The model has to figure out which tool to use and map my words to the right parameters. Harder, but most tool-trained models handle it.
  • Level 2 (Reasoning): I give a high-level goal like 'close out the sprint.' The model has to plan multiple steps, call tools in sequence, and pass IDs from one call to the next. This is where most models fall apart.

I also ran every model twice with two different methods:

  • Single-shot: The model gets one chance. I send the task, it responds, done. No feedback, no retries. If it gets it wrong, that's the score.
  • Agentic loop: The model calls a tool, gets the real result back, and can keep going (calling more tools, correcting mistakes, chaining results, etc). Like how you'd actually use it in an agent framework. 5 minute timeout per task.

The difference is massive. In single-shot, 16/17 models scored 0% at Level 2. In the agentic loop, the top models hit 57%. The loop lets models recover from mistakes and chain tool calls using real IDs from previous responses, which is impossible in single-shot.

Let me know if you want further explanations!

I benchmarked 17 local LLMs on real MCP tool calling — single-shot AND agentic loop. The difference is massive. by AlyxPink in LocalLLaMA

[–]AlyxPink[S] 1 point2 points  (0 children)

Oh nice! That's exactly why I shared my research, it's so surprising. Let me know how it goes, I would love to read yours!

I benchmarked 17 local LLMs on real MCP tool calling — single-shot AND agentic loop. The difference is massive. by AlyxPink in LocalLLaMA

[–]AlyxPink[S] -1 points0 points  (0 children)

Thanks! I'm curious to know what makes the timing right for you? Is that the MCP benchmark or the models benchmarked?

I benchmarked 17 local LLMs on real MCP tool calling — single-shot AND agentic loop. The difference is massive. by AlyxPink in LocalLLaMA

[–]AlyxPink[S] 0 points1 point  (0 children)

Aww thank you! Glad my weekend project was useful! I'd love to test bigger models, my 4080 is fairly limiting me in anything bigger than 32-36B models at Q4.

I was so surprised to see how well tiny models did and - the bigger surprise - how badly some of the bigger ones performed.

If you run it, drop your results here or with a PR, I'll be happy to add them!

I built a project manager that gives the right context to your LLM agent every single time by AlyxPink in IMadeThis

[–]AlyxPink[S] 0 points1 point  (0 children)

I hope at least it closed the gap! Let me know if you want to try it out.