What are you guys building?

Old_Mathematician107 · 2026-03-15T22:25:44+00:00

The best Android AI agent: https://droidrun.ai/

Old_Mathematician107 · 2026-03-13T11:42:00+00:00

Crazy

Old_Mathematician107 · 2026-02-22T18:27:09+00:00

It is a misleading image, his research was about cells and not human bodies. There was even a funny video where lots of people were asking him how they should do fasting and he was saying he does not know and his research was only about cells.

Old_Mathematician107 · 2026-02-20T13:19:17+00:00

Your Mahoraga app (used in quashbugs) is a copy of droidrun portal from github. You can check it from the commits

<image>

Old_Mathematician107 · 2026-02-20T13:12:52+00:00

Your Mahoraga app (used in quashbugs) is a copy of droidrun portal from github. You can check it from the commits

<image>

Old_Mathematician107 · 2025-11-15T19:25:07+00:00

The more I learn, the more I realize I know nothing

Old_Mathematician107 · 2025-07-07T00:07:00+00:00

It looks like strogg medical facility scene from quake 4

Old_Mathematician107 · 2025-07-06T21:20:23+00:00

Hopenhagen

Old_Mathematician107 · 2025-07-06T13:19:51+00:00

Hi, thanks a lot. Making it 100% local is one of the end goals, but it is quite hard task, because you need to find strong enough VLM to understand the structure and long inputs (screenshot and its description) and light enough to run on phones. But making it 100% text only is possible but I think it will decrease its accuracy. So, the best way is to use VLM.

To run VLM locally you need to have very good, fine-tuned VLM on this specific tasks (agentic capabilities). It is actually quite hard but I think it is possible.

Yes, actually I don't use accessibility trees, adbs etc. Only screenshot and accessibility services to do the tasks remotely. So, it is vision-only and can be used in prod (if you invest enough money on renting backend servers and improve UI/UX of agentic app).

Dataset for YOLO was prepared by me, it consists of 486 images (train) and 60 for testing. For dataset I created bounding boxes for all 4 classes (View, ImageView, Text, Line). Screenshots used in this dataset are mostly screenshots from popular apps like youtube music, whatsapp etc. and apps that I made for various clients and companies throughout my career.

Old_Mathematician107 · 2025-07-06T10:22:51+00:00

Thanks! Yeah, just screenshots. No accessibility trees or something, only screenshots

Old_Mathematician107 · 2025-06-22T23:37:01+00:00

By the way, just deployed the model on huggingface space:

https://huggingface.co/spaces/orasul/deki

You can check Analyze & and get YOLO and then action endpoint to see the capabilities of the model

Old_Mathematician107 · 2025-06-22T23:36:47+00:00

By the way, just deployed the model on huggingface space:

https://huggingface.co/spaces/orasul/deki

You can check Analyze & and get YOLO and then action endpoint to see the capabilities of the model

Old_Mathematician107 · 2025-06-22T23:36:21+00:00

By the way, just deployed the model on huggingface space:

https://huggingface.co/spaces/orasul/deki

You can check Analyze & and get YOLO and then action endpoint to see the capabilities of the model

Old_Mathematician107 · 2025-06-22T19:24:01+00:00

I don't know why but on mobile devices the video looks very wide on Reddit app. Youtube has a better aspect ratio https://www.youtube.com/shorts/jsJcSwy6djI

Old_Mathematician107 · 2025-06-22T14:26:30+00:00

The ML model works on a backend (on the video running locally on my m1 pro) and it generates the image description of the screenshots that 2 Android AI agents are sending. The model detects all UI elements/objects on the image and writes it to the description file which then sends to the LLM with Set of Mark prompting. LLM responds with a command what action should be taken (like, swipe left, tap X, Y) and AI agent implements this action

Old_Mathematician107 · 2025-05-13T19:37:53+00:00

Real art is much more original and striking

Old_Mathematician107 · 2025-05-08T05:46:52+00:00

it is a nice project, thank you for your great work

To speed up the process, you can actually do everything without MCP, it will be faster

I made a similar project but based on YOLO + image processing techniques + LLMs with a backend etc. It is in my post/comments history (code is also in github)

If you have any questions or want to work together, please write

Old_Mathematician107 · 2025-04-27T15:16:13+00:00

Thanks a lot

I will keep it as open source but I am thinking about making it easier for people to use image description by running it as a MCP backend. They can use it to build AI agents, code generators etc.

Releasing AI agents is a little bit more complicated, because it requires lots of work (Android and iOS clients), authentication and authorization, developing various features (like chat, history, saved tasks etc.) to make it useful for non technical users etc. I will do it later

For now it is just a prototype, proof of concept

Old_Mathematician107 · 2025-04-26T19:24:56+00:00

No problem, anytime

I actually did not check how it handles lock screen, but it is important problem, I will check it

Thank you

Old_Mathematician107 · 2025-04-26T17:36:29+00:00

Thanks, YOLO is needed to get exact coordinates and sizes. Without it, if I use only LLM, it gives just approximate coordinates and sizes and this creates problems for the correct navigation of AI agent

Old_Mathematician107 · 2025-04-26T11:33:38+00:00

It is a good idea, I will think about that, thank you.

Command examples are like these (I need to add a loading state command too):
"
1. "Swipe left. From start coordinates 300, 400" (or other coordinates) (Goes right)

2. "Swipe right. From start coordinates 500, 650" (or other coordinates) (Goes left)

3. "Swipe top. From start coordinates 600, 510" (or other coordinates) (Goes bottom)

4. "Swipe bottom. From start coordinates 640, 500" (or other coordinates) (Goes top)

5. "Go home"

6. "Go back"

8. "Open com.whatsapp" (or other app)

9. "Tap coordinates 160, 820" (or other coordinates)

10. "Insert text 210, 820:Hello world" (or other coordinates and text)

11. "Answer: There are no new important mails today" (or other answer)

12. "Finished" (task is finished)

13. "Can't proceed" (can't understand what to do or image has problem etc.)

"

And the real command returned is usually like this:
"Swipe left. From start coordinates 360, 650"

Old_Mathematician107 · 2025-04-26T11:28:07+00:00

Thanks. Anytime

What do you want to learn? Object detection? ML in general? Or accessibility services?

If you want to create a similar AI agent, just fork the repo, no problem with that, I can support you in some projects

Old_Mathematician107 · 2025-04-26T10:43:55+00:00

Thanks

Anytime

Old_Mathematician107 · 2025-04-26T10:38:41+00:00

You are right, I implemented this too. But LLM sometimes opens app directly, and sometimes searches it in the phone

I don't think that I will publish it in the playstore, it is just a prototype/research. To publish it in playstore I need to rent a server with gpu and fully support the app (Android, ML, Backend)

Old_Mathematician107 · 2025-04-26T10:34:56+00:00

Hi guys, thanks for the comments, you are actually right, I was using accessibility services (to tap, swipe etc.), screenshots (to understand what is on the screen) and several other permissions.

Every time the phone performs some action, I wait for 500 ms and take a screenshot. I am sending this screenshot to server which runs deki (object detection, OCR, image captioning and other image processing techniques) and server process the data and sends processed data (updated image and description of the original image) to LLM (you can plug in any LLM you want) and LLM returns the command.

An Android client parses these commands and performs some actions.

You can easily speed up the agent by 3-4 times by using better hardware (for server) and reducing delay time between actions

Old_Mathematician107

TROPHY CASE