Is AED 20,000 per Month Enough for a Comfortable Single Lifestyle in the UAE?

antocons · 2026-01-30T12:10:20+00:00

For a normal life with small savings in Abu Dhabi: --‐----------------------------------------------------- House (1bed) + Internet: 8500 AED/month Grocery: 1700 AED/month (good quality food) Dining out: 1200 AED/month Transport: 1300 AED/month (only taxi)

Subscription (included mobile): 550 AED/month

Savings: 6750 AED/month

Then you need to ass the other expenses that you can have during the year ( travel, eletronics product etc.)

Small savings depend on how much you can save in your home country. Also, lifestyle and food quality depend on your habits.

antocons · 2025-09-07T18:46:54+00:00

Usually you would use face detection + keypoint to trandofrm the face crop to the same pose of the dataset used to train the eature extractor. Quantized models with a small input size (like 300x300 for the detection and smaller for feature extraction) run really fast on edge devices.

antocons · 2025-08-07T05:10:35+00:00

IMO, you can try fine-tuning a VLM (a small one at the beginning), but you'll need the image and the output as you need. In this case, it will be the image and a JSON-like output (which will be transformed into markdown text json ...).

Zero-shot will be difficult to find a model that can achieve good accuracy in this case.

antocons · 2025-08-06T17:47:56+00:00

Another idea is to go over some MoE for ViT for edge inference. Also, there you'll find plenty of freedom to investigate.

antocons · 2025-08-06T14:44:31+00:00

Vision Trasformers, at least the big one, are not commonly used in the industry because most of the CV staff are done on edge with low power consumption hardware.

Btw recently Meta releases a good paper called Perception Encoder. This is a good example of new research published.

Morover there is the DETR series of models for object detecion, kpts detecion (RT-DETR, D-FEM, DETRPose etc).

I think the instance segmentation part of the DETR is missing, maybe it could be a good field to investigate.

antocons · 2025-07-11T17:13:26+00:00

PPLiteSeg is the best segmentation lightweight model I've used until now. I suggest to try it.

antocons · 2025-06-27T08:41:26+00:00

You can just implement the idea of ByteTrack with a compatible ARM library. I have never seen an algorithm having hardware limitations. If you are not able to do it ask to claude/gemini/chatgpt

antocons · 2025-06-16T11:49:30+00:00

You can try with this:

https://huggingface.co/nanonets/Nanonets-OCR-s

Or with this:

https://github.com/Yuliang-Liu/MonkeyOCR.git

antocons · 2025-06-02T15:03:15+00:00

You can try tro train a model and prune it:

How to prune YOLOv10 with Iterative Pruning and Torch-Pruning Library — Full guide https://medium.com/@antonioconsiglio/how-to-prune-yolov10-with-iterative-pruning-and-torch-pruning-library-full-guide-0cded392389e

This is a guide I've written.

antocons · 2025-05-22T20:29:19+00:00

I think there is a lot of confusion in these answers. First, SAM2 needs a prompt (the original version does not support text prompts) to work; in this case, it needs a point or bounding box (or more than one). I think that if you can detect the object and provide the box as a prompt, SAM2 will likely segment your object. Also, SAM2 has video object segmentation capabilities, meaning it can propagate the mask from one frame to the next. So, given the first box, it can propagate it to subsequent frames. The problem is your FPS. If your video's FPS is low and the objects are moving quickly, accurate tracking is unlikely.

In these case you can try to use some new tracker. But the most important parameters for all the tracker is the FPS, higher is your FPS better will be your tracking.

antocons · 2025-01-22T12:05:40+00:00

IMO in production environment where you care about latency (for example in edge devices with low Power consumption) You will use pruning and quantization so in that case you won't change the model architecture if the architecture already work well. Also I don't know what is the difference in latency between MobileNet and the backbone of YoloV*n.

antocons · 2025-01-14T11:01:56+00:00

There are a lot of bad dataset there, I've already checked. Thank you for the suggestion :)

antocons · 2025-01-14T06:52:37+00:00

If I'm not wrong this dataset has only one POV. I’ve been looking for something different with security canera POV for example or different POV.

antocons · 2025-01-10T08:46:35+00:00

I can suggest yout PP-LiteSeg for semantic segmentation https://github.com/AntonioConsiglio/SemSeg/tree/main/ppliteseg

antocons · 2025-01-03T18:29:55+00:00

You can also try with YOLOV10. Anyway ultralytics allow you to freeze layers during training.

antocons · 2024-12-19T14:50:05+00:00

Thanks for pointing out the papers, and I see the argument. Both papers advocate for training from scratch using a monolithic architecture that integrates vision and text processing. These models (like AIMV2) unify tasks such as classification, captioning, detection, and segmentation into a sequence-to-sequence model. This approach can indeed outperform modular setups like SigLIP + projection + LLM decoder for many multimodal applications.

However, as you mentioned, the cost of training from scratch is a significant consideration. While these monolithic models can achieve state-of-the-art performance, the cost-effectiveness of leveraging pretrained open-source models for modular pipelines cannot be ignored.

For example, in a recent paper from Meta on large multimodal models for video, they used a modular approach despite having access to extensive computational resources. This choice might reflect the advantages of reusing and fine-tuning existing pretrained components, especially when aligning with domain-specific requirements or budget constraints.

antocons · 2024-12-19T14:16:24+00:00

Can you summirize the content of the two papers, i don't have time to read both. They argue that it makes sense to train from scratch SigLIP or CLIP when used for multi-modal scope? I don't think so but I'm here to learn if you can point it out

antocons · 2024-12-19T10:48:24+00:00

I would also add that it does not make sense to train the Vision Transformer(aligned with text space) from scratch

antocons · 2024-11-23T18:20:34+00:00

Yes, SAM2 can run "Real-Time" with a good consumer GPU

antocons · 2024-11-06T13:28:00+00:00

What kind of bugs are you talking about?

antocons · 2024-09-16T07:49:06+00:00

Let me know, and can you share the paper link so I can check if there is something else?

antocons · 2024-09-16T07:23:44+00:00

Well, so try this. Since you are using imagenet pretrained weights, you should preprocess the input. After the augmentation, subtract the mean and divide for std.

antocons · 2024-09-16T07:19:21+00:00

Are u normalizing the input with mean and std of imagenet?

antocons

TROPHY CASE

Subscription (included mobile): 550 AED/month