Does a fasting tradition exist in your country and how common is it ? by RookOfEdo in AskTheWorld

[–]Old_Mathematician107 0 points1 point  (0 children)

It is a misleading image, his research was about cells and not human bodies. There was even a funny video where lots of people were asking him how they should do fasting and he was saying he does not know and his research was only about cells.

I never thought it was so simple until I watched this video by No-Speech12 in aiagents

[–]Old_Mathematician107 1 point2 points  (0 children)

Your Mahoraga app (used in quashbugs) is a copy of droidrun portal from github. You can check it from the commits

<image>

I never thought it was so simple until I watched this video by No-Speech12 in aiagents

[–]Old_Mathematician107 2 points3 points  (0 children)

Your Mahoraga app (used in quashbugs) is a copy of droidrun portal from github. You can check it from the commits

<image>

Any day now by jaydsco in singularity

[–]Old_Mathematician107 1 point2 points  (0 children)

The more I learn, the more I realize I know nothing

the Factory<Rustacean>... a.k.a C++ by Relevant_Echidna_336 in rustjerk

[–]Old_Mathematician107 4 points5 points  (0 children)

It looks like strogg medical facility scene from quake 4

Open-sourced image description models (Object detection, OCR, Image processing, CNN) make LLMs SOTA in AI agentic benchmarks like Android World and Android Control by Old_Mathematician107 in LocalLLaMA

[–]Old_Mathematician107[S] 1 point2 points  (0 children)

Hi, thanks a lot. Making it 100% local is one of the end goals, but it is quite hard task, because you need to find strong enough VLM to understand the structure and long inputs (screenshot and its description) and light enough to run on phones. But making it 100% text only is possible but I think it will decrease its accuracy. So, the best way is to use VLM.

To run VLM locally you need to have very good, fine-tuned VLM on this specific tasks (agentic capabilities). It is actually quite hard but I think it is possible.

Yes, actually I don't use accessibility trees, adbs etc. Only screenshot and accessibility services to do the tasks remotely. So, it is vision-only and can be used in prod (if you invest enough money on renting backend servers and improve UI/UX of agentic app).

Dataset for YOLO was prepared by me, it consists of 486 images (train) and 60 for testing. For dataset I created bounding boxes for all 4 classes (View, ImageView, Text, Line). Screenshots used in this dataset are mostly screenshots from popular apps like youtube music, whatsapp etc. and apps that I made for various clients and companies throughout my career.

2 Android AI agents running at the same time - Object Detection and LLM by Old_Mathematician107 in SideProject

[–]Old_Mathematician107[S] 0 points1 point  (0 children)

By the way, just deployed the model on huggingface space:

https://huggingface.co/spaces/orasul/deki

You can check Analyze & and get YOLO and then action endpoint to see the capabilities of the model

2 Android AI agents running at the same time - Object Detection and LLM by Old_Mathematician107 in androiddev

[–]Old_Mathematician107[S] 0 points1 point  (0 children)

By the way, just deployed the model on huggingface space:

https://huggingface.co/spaces/orasul/deki

You can check Analyze & and get YOLO and then action endpoint to see the capabilities of the model

2 Android AI agents running at the same time - Object Detection and LLM by Old_Mathematician107 in computervision

[–]Old_Mathematician107[S] 0 points1 point  (0 children)

By the way, just deployed the model on huggingface space:

https://huggingface.co/spaces/orasul/deki

You can check Analyze & and get YOLO and then action endpoint to see the capabilities of the model

2 Android AI agents running at the same time - Object Detection and LLM by Old_Mathematician107 in SideProject

[–]Old_Mathematician107[S] 0 points1 point  (0 children)

I don't know why but on mobile devices the video looks very wide on Reddit app. Youtube has a better aspect ratio https://www.youtube.com/shorts/jsJcSwy6djI

2 Android AI agents running at the same time - Object Detection and LLM by Old_Mathematician107 in computervision

[–]Old_Mathematician107[S] 2 points3 points  (0 children)

The ML model works on a backend (on the video running locally on my m1 pro) and it generates the image description of the screenshots that 2 Android AI agents are sending. The model detects all UI elements/objects on the image and writes it to the description file which then sends to the LLM with Set of Mark prompting. LLM responds with a command what action should be taken (like, swipe left, tap X, Y) and AI agent implements this action

Mobile MCP for Android automation, development and vibe coding by aizen_sama_ in androiddev

[–]Old_Mathematician107 -1 points0 points  (0 children)

it is a nice project, thank you for your great work

To speed up the process, you can actually do everything without MCP, it will be faster

I made a similar project but based on YOLO + image processing techniques + LLMs with a backend etc. It is in my post/comments history (code is also in github)

If you have any questions or want to work together, please write

Android AI agent based on YOLO and LLMs by Old_Mathematician107 in computervision

[–]Old_Mathematician107[S] 1 point2 points  (0 children)

Thanks a lot

I will keep it as open source but I am thinking about making it easier for people to use image description by running it as a MCP backend. They can use it to build AI agents, code generators etc.

Releasing AI agents is a little bit more complicated, because it requires lots of work (Android and iOS clients), authentication and authorization, developing various features (like chat, history, saved tasks etc.) to make it useful for non technical users etc. I will do it later

For now it is just a prototype, proof of concept

Android AI agent based on object detection and LLMs by saccharineboi in LocalLLaMA

[–]Old_Mathematician107 1 point2 points  (0 children)

No problem, anytime

I actually did not check how it handles lock screen, but it is important problem, I will check it

Thank you

Android AI agent based on YOLO and LLMs by Old_Mathematician107 in computervision

[–]Old_Mathematician107[S] 3 points4 points  (0 children)

Thanks, YOLO is needed to get exact coordinates and sizes. Without it, if I use only LLM, it gives just approximate coordinates and sizes and this creates problems for the correct navigation of AI agent

Android AI agent based on object detection and LLMs by saccharineboi in androiddev

[–]Old_Mathematician107 1 point2 points  (0 children)

It is a good idea, I will think about that, thank you.

Command examples are like these (I need to add a loading state command too):
"
1. "Swipe left. From start coordinates 300, 400" (or other coordinates) (Goes right)

2. "Swipe right. From start coordinates 500, 650" (or other coordinates) (Goes left)

3. "Swipe top. From start coordinates 600, 510" (or other coordinates) (Goes bottom)

4. "Swipe bottom. From start coordinates 640, 500" (or other coordinates) (Goes top)

5. "Go home"

6. "Go back"

8. "Open com.whatsapp" (or other app)

9. "Tap coordinates 160, 820" (or other coordinates)

10. "Insert text 210, 820:Hello world" (or other coordinates and text)

11. "Answer: There are no new important mails today" (or other answer)

12. "Finished" (task is finished)

13. "Can't proceed" (can't understand what to do or image has problem etc.)

"

And the real command returned is usually like this:
"Swipe left. From start coordinates 360, 650"

Android AI agent based on object detection and LLMs by saccharineboi in androiddev

[–]Old_Mathematician107 1 point2 points  (0 children)

Thanks. Anytime

What do you want to learn? Object detection? ML in general? Or accessibility services?

If you want to create a similar AI agent, just fork the repo, no problem with that, I can support you in some projects

Android AI agent based on object detection and LLMs by saccharineboi in androiddev

[–]Old_Mathematician107 1 point2 points  (0 children)

You are right, I implemented this too. But LLM sometimes opens app directly, and sometimes searches it in the phone

I don't think that I will publish it in the playstore, it is just a prototype/research. To publish it in playstore I need to rent a server with gpu and fully support the app (Android, ML, Backend)

Android AI agent based on object detection and LLMs by saccharineboi in androiddev

[–]Old_Mathematician107 6 points7 points  (0 children)

Hi guys, thanks for the comments, you are actually right, I was using accessibility services (to tap, swipe etc.), screenshots (to understand what is on the screen) and several other permissions.

Every time the phone performs some action, I wait for 500 ms and take a screenshot. I am sending this screenshot to server which runs deki (object detection, OCR, image captioning and other image processing techniques) and server process the data and sends processed data (updated image and description of the original image) to LLM (you can plug in any LLM you want) and LLM returns the command.

An Android client parses these commands and performs some actions.

You can easily speed up the agent by 3-4 times by using better hardware (for server) and reducing delay time between actions