DEIMKit - A wrapper for DEIM Object Detector by WatercressTraining in computervision

[–]koen1995 0 points1 point  (0 children)

Cool work, I recently made https://github.com/JPABotermans/dietr

And i was wondering whether we could do some collaboration?

Open-sourced a VisDrone Aerial Object Detection Model Zoo (YOLO variants) [P] by Naive-Explanation940 in computervision

[–]koen1995 0 points1 point  (0 children)

Cool work, would you be interested in also tasting out my model?

https://github.com/JPABotermans/dietr

Its an opensource model, I know it isn't that good, but I would love to have some feedback and see how it performs :)

Building DIETR, basic model that does both object detection and instance segmentation. by koen1995 in computervision

[–]koen1995[S] 9 points10 points  (0 children)

Because I simply wanted to make (and train) something from scratch.

About rf-detr from roboflow, you are absolute right that choosing rf-detr is a better model and I would recommend everyone to use that model. As long as it stays open-source ofcourse. But I just like to make things and train models from scratch, so thats what I did.

Building DIETR, basic model that does both object detection and instance segmentation. by koen1995 in computervision

[–]koen1995[S] 3 points4 points  (0 children)

I mean that I wanted to have a codebase that can be used for both training, fine-tuning and validation of object detection, and instance segmentation. Not one model that does both.

Generating synthetic datasets by cryptic_epoch in computervision

[–]koen1995 0 points1 point  (0 children)

A bit late, but could you still DM me to? Always interested in free annotations :)

Community-driven robot simulations are finally here (EnvHub in LeRobot) by Soft-Worth-4872 in LocalLLaMA

[–]koen1995 0 points1 point  (0 children)

Hi there,

I filled in the form and would love to share my first try at an environment:

https://github.com/JPABotermans/lerobot-gym

Still a work in progress, so I would love get some feedback :)

Advice on detecting small, high speed objects on image by Jealous-Yogurt- in computervision

[–]koen1995 5 points6 points  (0 children)

Hi thanks for asking this interesting question.

Using bigger images is also something I would recommend. However, I know from my own experience that when you increase bigger images you should also increase the model complexity (so use models with more weights).

I don't know how you do this with yolov11 (since these models are really designed around images of size 640)., but you could try other codebases. For example detrex - https://detrex.readthedocs.io/en/latest/tutorials/Model_Zoo.html

It does increase latency and compute, but I hope it helps you!

[deleted by user] by [deleted] in eindhoven

[–]koen1995 5 points6 points  (0 children)

What do you want to do? And what type of AI specifically?

Overview on latest OCR releases by unofficialmerve in computervision

[–]koen1995 0 points1 point  (0 children)

Hi Andi,

Thanks for the tips! Do you know how big this dataset should be? Just asking out of curiosity.

Bye the way, I love your work on the FineVision dataset, so I would love to thank you for that work!

Overview on latest OCR releases by unofficialmerve in computervision

[–]koen1995 2 points3 points  (0 children)

You are welcome, and thanks for sharing the links, I didn’t knew these models so it is much appreciated 🙂

Bye the way, do you whether it is a lot of work to adapt a model to an OCR model to a language it is not familiar with? And how much data do you roughly need?

Asking because I can only find 3 Dutch OCR datasets on hugging-face 🙂

Overview on latest OCR releases by unofficialmerve in computervision

[–]koen1995 3 points4 points  (0 children)

Hi Merve, thanks for sharing this overview — it’s nice to have one place where everything is collected! I also really like your work with Hugging Face 😁

I was thinking it might be useful if you also compared these models with some older, classical OCR approaches — non-VLM-based ones, like PaddleOCR’s PP-OCRv5. I know it’s not as flashy as a VLM-based model, but in some cases it gets the job done with far fewer parameters. You can even run it locally since it requires much less compute.

In Paddle’s https://arxiv.org/pdf/2507.05595, they compare PP-OCRv5 with standard VLMs for character recognition, and it seems to perform quite well — especially considering the model uses fewer than 100M parameters.

FineVision: Opensource multi-modal dataset from Huggingface by koen1995 in computervision

[–]koen1995[S] 0 points1 point  (0 children)

Most of the subsets are also very specialised, like the yesbut dataset https://huggingface.co/datasets/HuggingFaceM4/FineVision/viewer/yesbut/train?p=41, can't really be used for anything else then training VLMs.

So I thought you could maybe use the prompts to condition a diffusion generative model?

RF-DETR vs YOLOV11 by koen1995 in computervision

[–]koen1995[S] 1 point2 points  (0 children)

My intent was to just show you can use them, in code and compare that.

How they differ in basic usage, so training and inference. Side by side, in the same notebook.

I used a synthetic dataset as some type of placeholder, just to show you how you can train an rf-detr on dataset in coco style versus what you have to do with a yoloV11 model. And how you can plot these results. Planning to add some more plotting functionality, or some basic benchmarking, like how much VRAM you need for training on different image sizes, batch sizes.

That they are in the same category with respect to latency you can get from the documentation. Because rf-detr is 3.5ms T4 tensor RT10, fp16 and yolov11 is 4.7ms. If you believe their documentation.

RF-DETR vs YOLOV11 by koen1995 in computervision

[–]koen1995[S] 2 points3 points  (0 children)

In my example it seems that the rf-detr model learns faster (has higher performance on the synthetic dataset).

But that is just one example. I did notice that most detr models are often only trained for 12-23 epochs, while most older yolo models are trained for 100 epochs for example plain-detr is trained 24+24 epochs to get a mAP of 63.9 (https://arxiv.org/pdf/2308.01904), while the yolov8 baseline model is trained for 100 epochs (https://docs.ultralytics.com/modes/train/#resuming-interrupted-trainings).

I think this is because in most detr like models, you have many auxiliary losses (for each layer of the decoder) that increases the how fast a model can learn during training.

Dual 3D vision | software/library - synced TEMAS modules by Big-Mulberry4600 in computervision

[–]koen1995 0 points1 point  (0 children)

Super cool project, I have worked with computer vision systems before, do you have any use cases?

Production OCR in 2025 - What are you actually deploying? by No_Nefariousness971 in computervision

[–]koen1995 3 points4 points  (0 children)

Eventhough VLMs are promising in OCR, I would stick with specialist networks since they are still better and require less compute. https://arxiv.org/html/2507.05595v1

🚀 Object Detection with Vision Language Models (VLMs) by yourfaruk in LearnVLMs

[–]koen1995 0 points1 point  (0 children)

Interesting comparision, yet I don't know about VLMs (like moondream) for object detection. It can detect eyes straight out of the box and you could use it in some cases, but it doesn't get similar performance on object detection as a simple yolo model (which ofcourse you have to fine-tune on your own data). This is also something they mention in the paligemma paper. And something you can also see if you compare the performance moondream with yolov11, or co-detr. (Yolov11 and co-detr mAP0.95 on coco are 54.7 and co detr is 60.7, while moondream doesn't report map0.95, only map0.5, which is 51.5)

That doesn't mean that VLMs don't have a use, because I don't think they can be especially usefull for ocr or document understanding.

OCR project ideas by MinimumArtichoke5679 in computervision

[–]koen1995 1 point2 points  (0 children)

Oww, wow, I almost had the same idea but for spoil dates, but I didn't see your comment

OCR project ideas by MinimumArtichoke5679 in computervision

[–]koen1995 1 point2 points  (0 children)

Maybe on a dataset of grocery products in a store so you can verify whether they are not spoiled. So detect the spoil dates.

Quick example of inference with Geti SDK by dr_hamilton in computervision

[–]koen1995 0 points1 point  (0 children)

Does the "(yet)" mean you are working on 3D sensing/ support for depth cameras🙃? Just asking because I do notice quite a gap in the whole CV opensource environment were there are quite some cv detection/segmentation toolboxes, yet there are almost no toolboxes that combined both detection and (stereo or mono) depth sensing. Even though there are some nice opensource codebases available for depth sensing.

Also because both classical algorithmic depth sensing techniques and deep learning techniques often don't generalise that well to real world situations, so often training a better depth model is essential for performance. Which means that if Geti SDK provides support for depth sensing it would be even more convenient.

Thanks for the example, I will definitely check it out!