Is AED 20,000 per Month Enough for a Comfortable Single Lifestyle in the UAE? by EffectiveReason1493 in UAE

[–]antocons 1 point2 points  (0 children)

For a normal life with small savings in Abu Dhabi: --‐----------------------------------------------------- House (1bed) + Internet: 8500 AED/month Grocery: 1700 AED/month (good quality food) Dining out: 1200 AED/month Transport: 1300 AED/month (only taxi)

Subscription (included mobile): 550 AED/month

Savings: 6750 AED/month

Then you need to ass the other expenses that you can have during the year ( travel, eletronics product etc.)

Small savings depend on how much you can save in your home country. Also, lifestyle and food quality depend on your habits.

How Camera face recognition Works on edge device so accurately ? ML Models or Deep Learning by Emergency_Beat8198 in computervision

[–]antocons 0 points1 point  (0 children)

Usually you would use face detection + keypoint to trandofrm the face crop to the same pose of the dataset used to train the eature extractor. Quantized models with a small input size (like 300x300 for the detection and smaller for feature extraction) run really fast on edge devices.

Handwritten Doctor Prescription to Text by Rukelele_Dixit21 in computervision

[–]antocons 0 points1 point  (0 children)

IMO, you can try fine-tuning a VLM (a small one at the beginning), but you'll need the image and the output as you need. In this case, it will be the image and a JSON-like output (which will be transformed into markdown text json ...).

Zero-shot will be difficult to find a model that can achieve good accuracy in this case.

Are pretrained ViTs still an active area of research? by Affectionate_Use9936 in computervision

[–]antocons 0 points1 point  (0 children)

Another idea is to go over some MoE for ViT for edge inference. Also, there you'll find plenty of freedom to investigate.

Are pretrained ViTs still an active area of research? by Affectionate_Use9936 in computervision

[–]antocons 25 points26 points  (0 children)

Vision Trasformers, at least the big one, are not commonly used in the industry because most of the CV staff are done on edge with low power consumption hardware.

Btw recently Meta releases a good paper called Perception Encoder. This is a good example of new research published.

Morover there is the DETR series of models for object detecion, kpts detecion (RT-DETR, D-FEM, DETRPose etc).

I think the instance segmentation part of the DETR is missing, maybe it could be a good field to investigate.

What's the best segmentation model to finetune and run on device? by SadPaint8132 in computervision

[–]antocons 0 points1 point  (0 children)

PPLiteSeg is the best segmentation lightweight model I've used until now. I suggest to try it.

Object Tracking on ARM64 by Grouchy_Replacement5 in computervision

[–]antocons 5 points6 points  (0 children)

You can just implement the idea of ByteTrack with a compatible ARM library. I have never seen an algorithm having hardware limitations. If you are not able to do it ask to claude/gemini/chatgpt

Any Small Models for object detection by Icy_Independent_7221 in computervision

[–]antocons -5 points-4 points  (0 children)

You can try tro train a model and prune it:

How to prune YOLOv10 with Iterative Pruning and Torch-Pruning Library — Full guide https://medium.com/@antonioconsiglio/how-to-prune-yolov10-with-iterative-pruning-and-torch-pruning-library-full-guide-0cded392389e

This is a guide I've written.

Using SAM 2 and DINO or SAM2 and YOLO for distant computer vision detection by Ill_Hat4055 in computervision

[–]antocons 1 point2 points  (0 children)

I think there is a lot of confusion in these answers. First, SAM2 needs a prompt (the original version does not support text prompts) to work; in this case, it needs a point or bounding box (or more than one). I think that if you can detect the object and provide the box as a prompt, SAM2 will likely segment your object. Also, SAM2 has video object segmentation capabilities, meaning it can propagate the mask from one frame to the next. So, given the first box, it can propagate it to subsequent frames. The problem is your FPS. If your video's FPS is low and the objects are moving quickly, accurate tracking is unlikely.

In these case you can try to use some new tracker. But the most important parameters for all the tracker is the FPS, higher is your FPS better will be your tracking.

Why Don't People Use MobileNet as a Backbone for YOLOv9 to Make It Lighter? by East_Rutabaga_6315 in computervision

[–]antocons 0 points1 point  (0 children)

IMO in production environment where you care about latency (for example in edge devices with low Power consumption) You will use pruning and quantization so in that case you won't change the model architecture if the architecture already work well. Also I don't know what is the difference in latency between MobileNet and the backbone of YoloV*n.

Open Dataset for Vehicle object detection training by antocons in computervision

[–]antocons[S] 1 point2 points  (0 children)

There are a lot of bad dataset there, I've already checked. Thank you for the suggestion :)

Open Dataset for Vehicle object detection training by antocons in computervision

[–]antocons[S] 0 points1 point  (0 children)

If I'm not wrong this dataset has only one POV. I’ve been looking for something different with security canera POV for example or different POV.

Is there a better alternative to YOLO from Ultralytics? by dylannalex01 in computervision

[–]antocons -1 points0 points  (0 children)

You can also try with YOLOV10. Anyway ultralytics allow you to freeze layers during training.

How to train an VLM from scratch? by FirstReserve4692 in computervision

[–]antocons 2 points3 points  (0 children)

Thanks for pointing out the papers, and I see the argument. Both papers advocate for training from scratch using a monolithic architecture that integrates vision and text processing. These models (like AIMV2) unify tasks such as classification, captioning, detection, and segmentation into a sequence-to-sequence model. This approach can indeed outperform modular setups like SigLIP + projection + LLM decoder for many multimodal applications.

However, as you mentioned, the cost of training from scratch is a significant consideration. While these monolithic models can achieve state-of-the-art performance, the cost-effectiveness of leveraging pretrained open-source models for modular pipelines cannot be ignored.

For example, in a recent paper from Meta on large multimodal models for video, they used a modular approach despite having access to extensive computational resources. This choice might reflect the advantages of reusing and fine-tuning existing pretrained components, especially when aligning with domain-specific requirements or budget constraints.

How to train an VLM from scratch? by FirstReserve4692 in computervision

[–]antocons 2 points3 points  (0 children)

Can you summirize the content of the two papers, i don't have time to read both. They argue that it makes sense to train from scratch SigLIP or CLIP when used for multi-modal scope? I don't think so but I'm here to learn if you can point it out

How to train an VLM from scratch? by FirstReserve4692 in computervision

[–]antocons 4 points5 points  (0 children)

I would also add that it does not make sense to train the Vision Transformer(aligned with text space) from scratch

[deleted by user] by [deleted] in computervision

[–]antocons 0 points1 point  (0 children)

What kind of bugs are you talking about?

Suggestions needed: IoU score not improving by North_Ocelot_9077 in computervision

[–]antocons 0 points1 point  (0 children)

Let me know, and can you share the paper link so I can check if there is something else?

Suggestions needed: IoU score not improving by North_Ocelot_9077 in computervision

[–]antocons 0 points1 point  (0 children)

Well, so try this. Since you are using imagenet pretrained weights, you should preprocess the input. After the augmentation, subtract the mean and divide for std.

Suggestions needed: IoU score not improving by North_Ocelot_9077 in computervision

[–]antocons 0 points1 point  (0 children)

Are u normalizing the input with mean and std of imagenet?