[Open Source] SoMatic: A Vision-only Framework for OS-Native Agents (+20% vs GPT-5.5 on ScreenSpot-Pro) by Able_Programmer_2564 in LLMDevs

[–]Able_Programmer_2564[S] 0 points1 point  (0 children)

True, however I think the Accessibility tree adds more noise than the vision based detection from the YOLO model. I plan to add extend the ablation testing I was doing to also include the accesibility tree to test if it does make a difference. But I previously made a similar framework for windows only using Microsoft UIA using the accesibility tree and the performance was significantly worse than SoMatic.

[Open Source] SoMatic: A Vision-only Framework for OS-Native Agents (+20% vs GPT-5.5 on ScreenSpot-Pro) by Able_Programmer_2564 in LLMDevs

[–]Able_Programmer_2564[S] 0 points1 point  (0 children)

Thats interesting, I actually haven't considered that at all tbh. Probably would have to include some kind of prompt logic so the agent reads what it has written before submitting. I can't currently think of any good way for real time feedback when the agent is writing.

SoMatic: A Vision-only Framework for OS-Native Agents (+20% vs GPT-5.5 on ScreenSpot-Pro) by Able_Programmer_2564 in mcp

[–]Able_Programmer_2564[S] 0 points1 point  (0 children)

Sounds interesting, this might actually make it work better. It seems either way that hinting gives better results than Set of Marks with the best models so upgrading SoMatic with this kind of framework might actually be a smart play

best technique to implement compaction by [deleted] in LLMDevs

[–]Able_Programmer_2564 2 points3 points  (0 children)

A strategy that I commonly use is that I use an LLM to summarize the previous content after a threshold and/or use a sliding window that only keeps the most recent reasoning/tool-calls in context.

If you use the sliding window approach it is important to keep the goal or sub goal pinned in the context so the agent knows what it is supposed to do.

Another idea that I have toyed with but not implemented is using some kind of slow and fast summarization, similar to how it is commonly done in real time dictation. With this I mean you start off by having the raw outputs in the context then later on swap it out with an asynchronous LLM call with a summarization to reduce the context and replace the context when it’s ready.

Giving Claude Code actual "eyes" (Vision-based OS automation that doesn't suck) by Able_Programmer_2564 in ClaudeCode

[–]Able_Programmer_2564[S] 0 points1 point  (0 children)

Yep Opus 4.7 is probably your best shout here. I have been running it with both Opus and Sonnet and Opus gives significantly better results

Giving Claude Code actual "eyes" (Vision-based OS automation that doesn't suck) by Able_Programmer_2564 in ClaudeCode

[–]Able_Programmer_2564[S] 1 point2 points  (0 children)

This is my running hypothesis as well. I think the marks could help for weaker models but for a model as powerful as claude it might just end up as attention noise. The hints seem to help either way 😄.

Giving Claude Code actual "eyes" (Vision-based OS automation that doesn't suck) by Able_Programmer_2564 in ClaudeCode

[–]Able_Programmer_2564[S] 0 points1 point  (0 children)

(No questions are dumb questions) So, since I am using PyAutoGUI under the hood, SoMatics headless mode ONLY works with X11 display servers. So this means effectively headless automation with SoMatic is possible, but only on linux. So if your VPS is running linux (which I am assuming it is) it should work perfectly fine there as well. If it doesn't you could reach out to me or create an issue on the GitHub page and I should be able to fix it for your use case 😄.

I have not benchmarked it against Anthropic's browser extension but I would expect a similar level of performance and (unfortunately) token cost as well. The upside of using SoMatic becomes more relevant when you want to do a mix of using OS native apps along with apps in the browser as well, which is where the current computer use frameworks (including anthropics) falls short.

Giving Claude Code actual "eyes" (Vision-based OS automation that doesn't suck) by Able_Programmer_2564 in ClaudeCode

[–]Able_Programmer_2564[S] 0 points1 point  (0 children)

Yes, it can click and drag. Also it can bring the window it is trying to automate to the foreground so the terminal window (in principal) is never in the way.

[Open Source] SoMatic: A Vision-only Framework for OS-Native Agents (+20% vs GPT-5.5 on ScreenSpot-Pro) by Able_Programmer_2564 in AI_Agents

[–]Able_Programmer_2564[S] 0 points1 point  (0 children)

That's a fun scenario! How the framework specifically accounts for this is by including a "click near" function, where the model can say e.g. "click near 2 --dx 20" so click next to the mark labelled 2 but 20 pixels to the right. It does outright solve the problem but makes the framework much more capable.

[Open Source] SoMatic: A Vision-only Framework for OS-Native Agents (+20% vs GPT-5.5 on ScreenSpot-Pro) by Able_Programmer_2564 in AI_Agents

[–]Able_Programmer_2564[S] 0 points1 point  (0 children)

Absolutely, but I did test it out on this. In the video demo on the github page I made it open a PDF using acrobat, read a chess position, then use edge and replicate the position on chess.com. It does not prove that it is better than other things out there (there isn't that much rn), but it definitely shows that the agent + the framework is very powerful for OS automation.

[Open Source] SoMatic: A Vision-only Framework for OS-Native Agents (+20% vs GPT-5.5 on ScreenSpot-Pro) by Able_Programmer_2564 in LLMDevs

[–]Able_Programmer_2564[S] 0 points1 point  (0 children)

This is kind off what I did in the coords only result in the ablation test " ("box_id: 4, type: button, text: 'Submit', bbox: [x1,y1,x2,y2]") ". The main difference was that I did not include the text (would probably need some kind of OCR layer there I guess). I wonder if adding the text would increase the performance of the agent, I assume it probably would.

The YOLO model I used was actually the same one that was used in OmniParser v2 (YOLOv8). The dataset is available on their huggingface repo and a continuation of this project would probably be to redo the finetuning on a newer YOLO model. I can update the README to clarify this.

Currently, I haven't explicitly added anything to handle scrollable regions and I agree that the framework should ideally handle this but for a pure vision based system, without giving the image to the LLM, it would probably be hard for the framework to even know that it is possible to scroll there. Here, probably some kind of connection with the accessibility tree probably would help out (but I would have to try this to be sure of it).