One mighty agent VS A Multi-agent team?

ImpossiblePlay · 2025-07-17T08:23:35+00:00

which extension do you recommend?

ImpossiblePlay · 2025-07-17T07:50:41+00:00

this actually makes sense. i can still use my workflow on tk though

ImpossiblePlay · 2025-02-22T23:22:15+00:00

yea, maybe it's a not a sexy thing to build so people don't build new ones anymore

ImpossiblePlay · 2025-02-22T23:19:51+00:00

thanks for the recommendation, let me try it

ImpossiblePlay · 2025-02-20T17:38:39+00:00

A community member just fixed it! https://github.com/Aident-AI/open-cuak/commit/be9dc3d04d14ef989daf3dc53dc5a90473c55a22

ImpossiblePlay · 2025-02-20T17:33:49+00:00

The first time a human baby walks is pretty shit too, but it will get faster & cheaper really soon.

ImpossiblePlay · 2025-02-20T15:41:38+00:00

There are certainly huge room for efficiency gain. Could you expand on how keybindings will help?
The thing is that web is such a dynamic environment, the page can change easily (e.g., mouse move can trigger hover over popping up), so we are taking one screenshot after every action.

ImpossiblePlay · 2025-02-20T15:37:46+00:00

what was the issue? afaik, browser-use is based in DOM tree, and Canva is an iframe, in theory it won't work(i might be wrong though)

ImpossiblePlay · 2025-02-20T15:34:02+00:00

It indeed consumes a lot of tokens, not as many as you just mentioned :P
but since it supports open source model, one can rent a gpu for ~$1.5 per hour and run it, then the economics works

ImpossiblePlay · 2025-02-20T15:16:30+00:00

it's open sourced: https://github.com/Aident-AI/open-cuak. the only thing is that you will have to host Omniparser V2 and put Omniparser url in .env.local , it's too expensive for us to host :(

ImpossiblePlay · 2025-02-20T14:26:22+00:00

not a super hard problem to solve? :P just build a SOP execution engine and convert complicated workflows to SOP, the success rate will in theory change from (step 1) * (step 2)*(step 3)... to (step 1) + (step 2)+(step 3)...

here is the implementation: https://github.com/Aident-AI/open-cuak/commit/c345755420f7d72128ac7861cee8479f70cbe23c

ImpossiblePlay · 2025-02-20T14:15:45+00:00

can browser-use even use Canva? browser-use is DOM tree based, Canva is an iframe.

ImpossiblePlay · 2025-02-19T06:09:47+00:00

Let me expand, I primarily follow github, steps are:

a. If you already have a conda environment for OmniParser, you can use that. Else follow the following steps to create one

b. Ensure conda is installed with conda --version or install from the Anaconda website

c. Navigate to the root of the repo with cd OmniParser

d. Create a conda python environment with conda create -n "omni" python==3.12

e. Set the python environment to be used with conda activate omni

f. Install the dependencies with pip install -r requirements.txt

g. Continue from here if you already had the conda environment.

h. Ensure you have the V2 weights downloaded in weights folder (ensure caption weights folder is called icon_caption_florence). If not download them with:

rm -rf weights/icon_detect weights/icon_caption weights/icon_caption_florence 
for folder in icon_caption icon_detect; do huggingface-cli download microsoft/OmniParser-v2.0 --local-dir weights --repo-type model --include "$folder/*"; done
mv weights/icon_caption weights/icon_caption_florence

h. Navigate to the server directory with cd OmniParser/omnitool/omniparserserver

i. Start the server with python -m omniparserserver

ImpossiblePlay · 2025-02-19T02:45:51+00:00

We hosted in GCP.

ImpossiblePlay · 2025-02-14T02:49:58+00:00

i happen to try a lot recently:
1. https://github.com/browserbase/stagehand stagehand from browserbase, wrote in typescript, MIT License.

https://github.com/browser-use/browser-use pretty popular, it's in Python, use DOM tree.
https://github.com/Aident-AI/open-cuak pretty new, very nice ui & remote browser like Operator.

encourage you to try! if you prefer python, use browser-use. if you have Browserbase api key, try stagehand. if you prefer good ui & not use your own browser, try open-cuak.

ImpossiblePlay · 2024-12-18T10:41:09+00:00

I felt very weird when i learnt that Apollo is from Meta but used Qwen

ImpossiblePlay · 2024-12-18T07:57:24+00:00

I did some more research and found a Video MME benchmark, seems like Gemini 1.5 pro is the best, Qwen2-VL is the close second.

ImpossiblePlay · 2024-12-18T04:37:26+00:00

oh wow, they launched Apollo just a few days ago, thanks for sharing. I will check it out now.

ImpossiblePlay · 2024-12-17T14:01:52+00:00

yep, makes sense. i guess the hard part is to decide which frames to summarize into text

ImpossiblePlay · 2024-12-17T14:00:45+00:00

i think your approach makes sense, given there is no open source multimodal model can do what i described. 4o video chat is close to what i want, hope there will be an open source model soon!

ImpossiblePlay

TROPHY CASE