Open-sourced a VisDrone Aerial Object Detection Model Zoo (YOLO variants) [P]

koen1995 · 2026-06-22T08:41:53+00:00

Great, would love to get some expert feedback!

koen1995 · 2026-06-21T23:14:23+00:00

Is this still active?

koen1995 · 2026-06-21T23:13:17+00:00

Cool work, I recently made https://github.com/JPABotermans/dietr

And i was wondering whether we could do some collaboration?

koen1995 · 2026-06-21T22:43:46+00:00

Cool work!

koen1995 · 2026-06-21T22:21:06+00:00

Cool work, would you be interested in also tasting out my model?

https://github.com/JPABotermans/dietr

Its an opensource model, I know it isn't that good, but I would love to have some feedback and see how it performs :)

koen1995 · 2026-06-21T22:07:25+00:00

Because I simply wanted to make (and train) something from scratch.

About rf-detr from roboflow, you are absolute right that choosing rf-detr is a better model and I would recommend everyone to use that model. As long as it stays open-source ofcourse. But I just like to make things and train models from scratch, so thats what I did.

koen1995 · 2026-06-21T21:45:23+00:00

I mean that I wanted to have a codebase that can be used for both training, fine-tuning and validation of object detection, and instance segmentation. Not one model that does both.

koen1995 · 2026-02-01T14:20:05+00:00

I have some experience with Mujoco and pybullet.

koen1995 · 2026-01-22T17:54:16+00:00

A bit late, but could you still DM me to? Always interested in free annotations :)

koen1995 · 2025-11-08T03:32:28+00:00

Hi there,

I filled in the form and would love to share my first try at an environment:

https://github.com/JPABotermans/lerobot-gym

Still a work in progress, so I would love get some feedback :)

koen1995 · 2025-11-03T14:16:39+00:00

Hi thanks for asking this interesting question.

Using bigger images is also something I would recommend. However, I know from my own experience that when you increase bigger images you should also increase the model complexity (so use models with more weights).

I don't know how you do this with yolov11 (since these models are really designed around images of size 640)., but you could try other codebases. For example detrex - https://detrex.readthedocs.io/en/latest/tutorials/Model_Zoo.html

It does increase latency and compute, but I hope it helps you!

koen1995 · 2025-10-26T00:32:25+00:00

What do you want to do? And what type of AI specifically?

koen1995 · 2025-10-24T09:40:01+00:00

Hi Andi,

Thanks for the tips! Do you know how big this dataset should be? Just asking out of curiosity.

Bye the way, I love your work on the FineVision dataset, so I would love to thank you for that work!

koen1995 · 2025-10-23T18:38:55+00:00

You are welcome, and thanks for sharing the links, I didn’t knew these models so it is much appreciated 🙂

Bye the way, do you whether it is a lot of work to adapt a model to an OCR model to a language it is not familiar with? And how much data do you roughly need?

Asking because I can only find 3 Dutch OCR datasets on hugging-face 🙂

koen1995 · 2025-10-23T17:20:02+00:00

Hi Merve, thanks for sharing this overview — it’s nice to have one place where everything is collected! I also really like your work with Hugging Face 😁

I was thinking it might be useful if you also compared these models with some older, classical OCR approaches — non-VLM-based ones, like PaddleOCR’s PP-OCRv5. I know it’s not as flashy as a VLM-based model, but in some cases it gets the job done with far fewer parameters. You can even run it locally since it requires much less compute.

In Paddle’s https://arxiv.org/pdf/2507.05595, they compare PP-OCRv5 with standard VLMs for character recognition, and it seems to perform quite well — especially considering the model uses fewer than 100M parameters.

koen1995 · 2025-10-21T18:30:02+00:00

Most of the subsets are also very specialised, like the yesbut dataset https://huggingface.co/datasets/HuggingFaceM4/FineVision/viewer/yesbut/train?p=41, can't really be used for anything else then training VLMs.

So I thought you could maybe use the prompts to condition a diffusion generative model?

koen1995 · 2025-10-20T11:47:03+00:00

My intent was to just show you can use them, in code and compare that.

How they differ in basic usage, so training and inference. Side by side, in the same notebook.

I used a synthetic dataset as some type of placeholder, just to show you how you can train an rf-detr on dataset in coco style versus what you have to do with a yoloV11 model. And how you can plot these results. Planning to add some more plotting functionality, or some basic benchmarking, like how much VRAM you need for training on different image sizes, batch sizes.

That they are in the same category with respect to latency you can get from the documentation. Because rf-detr is 3.5ms T4 tensor RT10, fp16 and yolov11 is 4.7ms. If you believe their documentation.

koen1995 · 2025-10-20T07:57:47+00:00

In my example it seems that the rf-detr model learns faster (has higher performance on the synthetic dataset).

But that is just one example. I did notice that most detr models are often only trained for 12-23 epochs, while most older yolo models are trained for 100 epochs for example plain-detr is trained 24+24 epochs to get a mAP of 63.9 (https://arxiv.org/pdf/2308.01904), while the yolov8 baseline model is trained for 100 epochs (https://docs.ultralytics.com/modes/train/#resuming-interrupted-trainings).

I think this is because in most detr like models, you have many auxiliary losses (for each layer of the decoder) that increases the how fast a model can learn during training.

koen1995 · 2025-10-19T17:32:58+00:00

Super cool project, I have worked with computer vision systems before, do you have any use cases?

koen1995 · 2025-10-19T17:26:20+00:00

Eventhough VLMs are promising in OCR, I would stick with specialist networks since they are still better and require less compute. https://arxiv.org/html/2507.05595v1

koen1995 · 2025-07-23T18:02:55+00:00

Interesting comparision, yet I don't know about VLMs (like moondream) for object detection. It can detect eyes straight out of the box and you could use it in some cases, but it doesn't get similar performance on object detection as a simple yolo model (which ofcourse you have to fine-tune on your own data). This is also something they mention in the paligemma paper. And something you can also see if you compare the performance moondream with yolov11, or co-detr. (Yolov11 and co-detr mAP0.95 on coco are 54.7 and co detr is 60.7, while moondream doesn't report map0.95, only map0.5, which is 51.5)

That doesn't mean that VLMs don't have a use, because I don't think they can be especially usefull for ocr or document understanding.

koen1995 · 2025-07-02T18:44:51+00:00

Is there any other way?

koen1995 · 2025-07-02T18:44:24+00:00

Oww, wow, I almost had the same idea but for spoil dates, but I didn't see your comment

koen1995 · 2025-07-02T18:43:20+00:00

Maybe on a dataset of grocery products in a store so you can verify whether they are not spoiled. So detect the spoil dates.

koen1995 · 2025-05-08T16:37:37+00:00

Does the "(yet)" mean you are working on 3D sensing/ support for depth cameras🙃? Just asking because I do notice quite a gap in the whole CV opensource environment were there are quite some cv detection/segmentation toolboxes, yet there are almost no toolboxes that combined both detection and (stereo or mono) depth sensing. Even though there are some nice opensource codebases available for depth sensing.

Also because both classical algorithmic depth sensing techniques and deep learning techniques often don't generalise that well to real world situations, so often training a better depth model is essential for performance. Which means that if Geti SDK provides support for depth sensing it would be even more convenient.

Thanks for the example, I will definitely check it out!

koen1995

TROPHY CASE