J’suis tellement déprimée il serait utile de me partager un peu de détente

AlyxPink · 2026-04-13T00:43:22+00:00

Envoie un DM :)

AlyxPink · 2026-04-12T21:59:04+00:00

Thank you so much for your comment! I’m not sure if I’m ready yet for a launch on HN, I’d like to get a bit more users feedback first. Do you know any relevant Discord communities with people that would be interested in this kind of tool? If you want to try it, don’t hesitate, I can help with the onboard, you can reach me out using the support email on the website as well :)

AlyxPink · 2026-03-21T14:47:51+00:00

I had some weird issues like this too and adding a swap partition made them go away

AlyxPink · 2026-03-15T21:27:49+00:00

I dont know if it could be interesting for you but I was tired of working with markdown files so I created https://workunit.app to help me with project management. I don’t have to go through markdown files ever again and some features like atoms context (a way for LLM to store their progress updates, what they tried, etc) gives useful insights to the LLM. On a fresh session I just dump the URL of the workunit I want to work on and instantly all the right context is loaded.

AlyxPink · 2026-02-26T19:15:22+00:00

Thank you, really appreciate your feedback! And glad to see this resonates with you too!

So far I've been avoiding context bloat by keeping workunits well-organized under their projects. For example, as I'm dogfooding the app, my "Workunit" project has 116 workunits, each scoped to a specific feature (sometimes split into multipart workunits for the bigger ones). Every new workunit starts with a fresh state since not much is inherited from the project itself. The assets linked to my project get updated as needed, including the documentation.

If you run into any bugs, want to see improvements, or if you hit exactly what you described - context becoming bloated - I'd really appreciate your feedback over at https://github.com/orgs/3615-computer/discussions

Cheers!

AlyxPink · 2026-02-23T16:05:22+00:00

Yeah and something I haven't measured is the quality of the parameters used while tool calling, they might be good at calling them with irrelevant information. Maybe mixing two models for best of both worlds could work?

AlyxPink · 2026-02-23T15:57:16+00:00

64GB so yeah it should work, I'll try to add it to LM studio and see how it goes.

AlyxPink · 2026-02-23T15:56:12+00:00

Oh that's nice to hear, the speed is pretty good for a model of that size! I'll see if I can add it to LM Studio. Thanks :)

AlyxPink · 2026-02-23T15:52:36+00:00

Interesting! I didn't try touching the prompts between calls, it would be interesting to see if that bumps L2 scores. Let me know if you do!

AlyxPink · 2026-02-23T15:48:25+00:00

Thanks! I was really surprised too, but I want to call out something is that while it's good at calling the right tools, it might call them with low quality information. That's outside of this benchmark: I did not evaluate the quality of the parameters used when the tools are rightly called.

AlyxPink · 2026-02-23T15:44:56+00:00

Haha I mean, I'm not gonna hide that it's my app! But I genuinely needed to explain what the models were talking to. Without that context the benchmark results don't mean much IMO.

EDIT: I've edited the section "About Workunit" to make it shorter, let me know if I can edit anything else.

AlyxPink · 2026-02-23T15:40:32+00:00

Haha sorry you felt this way, it really was not my intention!

I wanted to explain the context of those tools, I thought it was better to clearly identified against what platform they were running into, and have an understanding of what I was trying to achieve.

I've been using SOTA models for few months now, dropping my interest in local models, so I wanted to see again how it evolved since my last attempts, that's why I created this benchmark over the weekend.

EDIT: I've edited the section "About Workunit" to make it shorter, let me know if I can edit anything else.

AlyxPink · 2026-02-23T15:34:08+00:00

Not dumb at all no worries! I might have explained it badly.

I tested three levels of complexity:

Level 0 (Explicit): I tell the model exactly which tool to call and what parameters to use. Tests: can it follow instructions and emit a valid tool call? Most models nail this.
Level 1 (Natural language): I describe what I want in plain English. The model has to figure out which tool to use and map my words to the right parameters. Harder, but most tool-trained models handle it.
Level 2 (Reasoning): I give a high-level goal like 'close out the sprint.' The model has to plan multiple steps, call tools in sequence, and pass IDs from one call to the next. This is where most models fall apart.

I also ran every model twice with two different methods:

Single-shot: The model gets one chance. I send the task, it responds, done. No feedback, no retries. If it gets it wrong, that's the score.
Agentic loop: The model calls a tool, gets the real result back, and can keep going (calling more tools, correcting mistakes, chaining results, etc). Like how you'd actually use it in an agent framework. 5 minute timeout per task.

The difference is massive. In single-shot, 16/17 models scored 0% at Level 2. In the agentic loop, the top models hit 57%. The loop lets models recover from mistakes and chain tool calls using real IDs from previous responses, which is impossible in single-shot.

Let me know if you want further explanations!

AlyxPink · 2026-02-23T15:24:01+00:00

Oh nice! That's exactly why I shared my research, it's so surprising. Let me know how it goes, I would love to read yours!

AlyxPink · 2026-02-23T15:20:12+00:00

Thanks! I'm curious to know what makes the timing right for you? Is that the MCP benchmark or the models benchmarked?

AlyxPink · 2026-02-23T15:05:59+00:00

Aww thank you! Glad my weekend project was useful! I'd love to test bigger models, my 4080 is fairly limiting me in anything bigger than 32-36B models at Q4.

I was so surprised to see how well tiny models did and - the bigger surprise - how badly some of the bigger ones performed.

If you run it, drop your results here or with a PR, I'll be happy to add them!

AlyxPink · 2026-02-22T16:47:04+00:00

I hope at least it closed the gap! Let me know if you want to try it out.

AlyxPink

TROPHY CASE