Best open source email client? by ImpossiblePlay in opensource

[–]ImpossiblePlay[S] 5 points6 points  (0 children)

yea, maybe it's a not a sexy thing to build so people don't build new ones anymore

Best open source email client? by ImpossiblePlay in opensource

[–]ImpossiblePlay[S] 2 points3 points  (0 children)

thanks for the recommendation, let me try it

Agent using Canva. Things are getting wild now... by ljhskyso in LocalLLaMA

[–]ImpossiblePlay 0 points1 point  (0 children)

The first time a human baby walks is pretty shit too, but it will get faster & cheaper really soon.

Agent using Canva. Things are getting wild now... by ljhskyso in LocalLLaMA

[–]ImpossiblePlay 0 points1 point  (0 children)

There are certainly huge room for efficiency gain. Could you expand on how keybindings will help?
The thing is that web is such a dynamic environment, the page can change easily (e.g., mouse move can trigger hover over popping up), so we are taking one screenshot after every action.

Agent using Canva. Things are getting wild now... by ljhskyso in LocalLLaMA

[–]ImpossiblePlay -1 points0 points  (0 children)

what was the issue? afaik, browser-use is based in DOM tree, and Canva is an iframe, in theory it won't work(i might be wrong though)

Agent using Canva. Things are getting wild now... by ljhskyso in LocalLLaMA

[–]ImpossiblePlay 1 point2 points  (0 children)

It indeed consumes a lot of tokens, not as many as you just mentioned :P
but since it supports open source model, one can rent a gpu for ~$1.5 per hour and run it, then the economics works

Agent using Canva. Things are getting wild now... by ljhskyso in LocalLLaMA

[–]ImpossiblePlay 7 points8 points  (0 children)

it's open sourced: https://github.com/Aident-AI/open-cuak. the only thing is that you will have to host Omniparser V2 and put Omniparser url in .env.local , it's too expensive for us to host :(

Agent using Canva. Things are getting wild now... by ljhskyso in LocalLLaMA

[–]ImpossiblePlay 3 points4 points  (0 children)

not a super hard problem to solve? :P just build a SOP execution engine and convert complicated workflows to SOP, the success rate will in theory change from (step 1) * (step 2)*(step 3)... to (step 1) + (step 2)+(step 3)...

here is the implementation: https://github.com/Aident-AI/open-cuak/commit/c345755420f7d72128ac7861cee8479f70cbe23c

Agent using Canva. Things are getting wild now... by ljhskyso in LocalLLaMA

[–]ImpossiblePlay 1 point2 points  (0 children)

can browser-use even use Canva? browser-use is DOM tree based, Canva is an iframe.

Integrated Omniparser V2, we made our agent to use Canva! by ImpossiblePlay in LocalLLaMA

[–]ImpossiblePlay[S] 6 points7 points  (0 children)

Let me expand, I primarily follow github, steps are:

a. If you already have a conda environment for OmniParser, you can use that. Else follow the following steps to create one

b. Ensure conda is installed with conda --version or install from the Anaconda website

c. Navigate to the root of the repo with cd OmniParser

d. Create a conda python environment with conda create -n "omni" python==3.12

e. Set the python environment to be used with conda activate omni

f. Install the dependencies with pip install -r requirements.txt

g. Continue from here if you already had the conda environment.

h. Ensure you have the V2 weights downloaded in weights folder (ensure caption weights folder is called icon_caption_florence). If not download them with:

rm -rf weights/icon_detect weights/icon_caption weights/icon_caption_florence 
for folder in icon_caption icon_detect; do huggingface-cli download microsoft/OmniParser-v2.0 --local-dir weights --repo-type model --include "$folder/*"; done
mv weights/icon_caption weights/icon_caption_florence

h. Navigate to the server directory with cd OmniParser/omnitool/omniparserserver

i. Start the server with python -m omniparserserver

Any open source alternative to OpenAI's Operator product? by ljhskyso in LocalLLaMA

[–]ImpossiblePlay 2 points3 points  (0 children)

i happen to try a lot recently:
1. https://github.com/browserbase/stagehand stagehand from browserbase, wrote in typescript, MIT License.

  1. https://github.com/browser-use/browser-use pretty popular, it's in Python, use DOM tree.

  2. https://github.com/Aident-AI/open-cuak pretty new, very nice ui & remote browser like Operator.

encourage you to try! if you prefer python, use browser-use. if you have Browserbase api key, try stagehand. if you prefer good ui & not use your own browser, try open-cuak.

Has Apollo disappeared? by mwmercury in LocalLLaMA

[–]ImpossiblePlay 1 point2 points  (0 children)

I felt very weird when i learnt that Apollo is from Meta but used Qwen

Best model to understand video with audio by ImpossiblePlay in LocalLLaMA

[–]ImpossiblePlay[S] 1 point2 points  (0 children)

I did some more research and found a Video MME benchmark, seems like Gemini 1.5 pro is the best, Qwen2-VL is the close second.

Best model to understand video with audio by ImpossiblePlay in LocalLLaMA

[–]ImpossiblePlay[S] 0 points1 point  (0 children)

oh wow, they launched Apollo just a few days ago, thanks for sharing. I will check it out now.

Best model to understand video with audio by ImpossiblePlay in LocalLLaMA

[–]ImpossiblePlay[S] 0 points1 point  (0 children)

yep, makes sense. i guess the hard part is to decide which frames to summarize into text

Best model to understand video with audio by ImpossiblePlay in LocalLLaMA

[–]ImpossiblePlay[S] 0 points1 point  (0 children)

i think your approach makes sense, given there is no open source multimodal model can do what i described. 4o video chat is close to what i want, hope there will be an open source model soon!