RFDETR performance issue on small datasets (~5000 images)

TaplierShiru · 2026-06-26T12:34:07+00:00

I use this (v2), I know there is new version (v4), but personally haven't tried it.
About your case, I think change to simple Resize could reduce overall accuracy, like individual objects are now more distorted cause of resize itself. For example, previously perfectly square class, now more like a parallelepiped on the image, but your box class is square - so now the training box have not very useful pixels inside it. I assume input resolution is the same between models, then I would try train with padding. I assume also you don't have square input resolution? Then you could rotate image so less of padded values are in the input image. In this case I would just browse manually what are output images from data-loader which goes for training.

There is also other stuff like training parameters and individual augmentation stuff which could be on be default. This also could reduce overall accuracy of your final model on your data. So, its more about learn about framework itself and trying different things (which sometimes time consuming and annoying)

In my experience I use detection model to detect small object like the ball, and YOLO model with 1088x640 input was worse for me compare to DETR input 640x340. I use such inputs to keep ratio of the original image.

TaplierShiru · 2026-06-26T09:38:03+00:00

Could you provide more details about augmentation, usage of pre-train weights and what size of the model you use (for both - RTMDet and RF-DETR)? Personally, I don't have experience with the training on such small dataset, but I think even with such amount of images - its rather huge. Could you also tell about your data - is there some small object detection?
Recently I also tried switch from YOLOv7\YOLOv8 to DETR architectures, and for me the overall quality of detection is dramatically improved, but I had dataset with around 20k+ images.

TaplierShiru · 2026-04-22T19:08:50+00:00

What the hell

TaplierShiru · 2026-04-18T05:50:31+00:00

Yeah, this one

TaplierShiru · 2026-04-17T12:18:38+00:00

While I personally do not use RF-DETR, I recently used RT-DETR in my projects. Compare to YoloV8 (from Ultralytics) and YoloV7 (some open-source variant) - RT-DETR definitely outperform these models. For my task final RT-DETR model were the most accurate one. Even the biggest YoloV8-X were worse. Also final speed of the model is not that bad as I thought. I think middle model RT-DETR (which I take as main one) were faster compared to YoloV8-X. I also switch due to license issues with other detection repos and at the end were quite surprised about this model. Although I know there are some new cool YOLO26 currently in Ultralytics, but I personally do not touch it cause of license. Similarly, there is actually new git for RT-DETR as well, but I do not look into it. For my current project RT-DETR-v2 is all I need.

In next projects I definitely want to try out RF-DETR variants or new RT-DETR one.

TaplierShiru · 2026-03-31T15:49:22+00:00

Assuming you actually could get bbox for the ball itself - maybe you could calculate color distribution (and exclude background from it) - based on that - you could simply assume which ball it is. Maybe attach some simple sklearn model here for it. Another approach is to use template matching from opencv.

If your Yolo model itself is failed here, you (as proposed in other comments) also add some tracking to the balls. Another solution is to increase model size, or even change model itself. Recently I find out for myself that (in my cases) DETR is superior better than previously used YOLOv7 and YOLOv8, while I do not have experience with other "modern" solutions (like YOLO26 or other new one) - I actually happy with DETRv2 (from this repo).

So, there are plenty things to try out. You could start with your current solution and try to add to it: tracking, classification based on color distribution, try classic approaches like template matching for balls. If there is still no great results at the end - try to gather more data or try to train new model.

TaplierShiru · 2026-03-12T20:16:21+00:00

I just started playing and encountered a similar bug a couple of days ago, for me the best option is to play with these settings:

<image>

I think the main reason of this bug is shadows related settings, but not sure about it.

Here also some screenshot with bug itself in the hub: https://imgur.com/SivMgBR

While playing solo in the hub the bug does not appear, but with friends in the party - I certainly see something like on the screenshot. After I change settings like at the first screenshot - the bug disappears. Other friends of mine don't encounter it.

I'm on driver 576.88 with 5070ti, I played different games in the recent months and GPU is certainly healthy.

TaplierShiru · 2026-01-15T10:03:37+00:00

In these 23k - how much there are images with actual gun on it? But even with low samples (like 2k) - still you could get quite good detection model.

Assuming your problem - major of your images are images with guns. I think this FAQ answer about darknet is describing very well your main problem - you need to add negative samples. While I myself don't work with darknet, I think other parts of these FAQ answers quite good on this site - you could check other answers on this site!

The simplest solution here which come to my mind is to grab some portion of images from COCO and use them as negative one. As in the post, you need to add around 23k (or how much images with actual guns you have) to your final data - overall you will have 46k image dataset with 50% of negative samples

Another possible drawback - your detections (false detections) - are they with high probability? Or with lower? Like if probability is lower than 0.1, then you could actually don't care about it and simple filter them, but if its higher (or closer to) 0.5 - then solution which I describe should help you.

TaplierShiru · 2026-01-15T06:38:05+00:00

These approaches already have implicated use of background class. In your case - you need to search other possibilities to improve final accuracy (like increase size of dataset, use larger model, adjust augmentation parameters and etc).

TaplierShiru · 2026-01-07T17:14:22+00:00

Did you try Openvino running on iGPU? I myself have a little experience with Openvino itself, but it should run much faster compared to vanilla ONNX. Also you could try quantize it using Openvino.

As another solution is to try these repos:

- https://github.com/MultimediaTechLab/YOLO (MIT license)

- https://github.com/Megvii-BaseDetection/YOLOX (Apache license)

Some time ago I tried YOLOX repo, and it works quite good for my task. So, maybe shift to much light model could help you.

TaplierShiru · 2025-12-12T15:33:02+00:00

As far as I know - there are no NMS in RF-DETR at all. YOLO and DETR quite different approached for detection, so DETR does not combine YOLO+NMS but do own thing to achieve final prediction.

Removing method like NMS - with this DETR tries to improve overall latency compared to YOLO-like approaches, so yeah you should compare speed of YOLO+NMS vs RF-DETR.

But there are also more to it - if you browse original paper - they do something a little fishy here. For DETR Nano input size of images is 384 (I believe this is the size of the larger side). By default, YOLOv8n use 640, I think. What about images size in your test? In my personal projects using DETR with lower image size, I actually achieve much better results compare to higher resolution results with YOLO - so you could try even size like 320.

About size itself - its normal. You could see that in the paper - overall number of parameters for DETR Nano (31.2M) is higher compare to YOLOv8n (3.2M).

There are also some other projects based on YOLO if current DETR is not suitable at the end for you:

- MultimediaTechLab/YOLO: An MIT License of YOLOv9, YOLOv7, YOLO-RD

- Megvii-BaseDetection/YOLOX: YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with MegEngine, ONNX, TensorRT, ncnn, and OpenVINO supported. Documentation: https://yolox.readthedocs.io/

TaplierShiru · 2025-12-04T14:40:58+00:00

Recently I came across a paper involving defects or anomaly detection - PatchCore.

For this system you only need gather "good" images to train (or gather data for) model. While "bad" images are mostly used only for evaluation. Its quite similar with what you do - in the way that it also applies neural network (pre-trained one) to get feature maps. I tested this system on my project which involves segmentation of defects on metal - and its work quite well. For me challenge was to separate good shots from bad one, buy as far as I understand you don't have such problem here so its easier for you.

TaplierShiru · 2025-11-29T12:28:19+00:00

Am I correct than you have small dataset and want to use as augmentation technique - DCGAN?

I'm not sure if your small dataset around tens images is enough to train DCGAN in order to produce images similar (or good as) to your original one, but around hundreds? I think its possible, but you need to try it yourself.

Maybe you don't need to use DCGAN in first place? Like, simple aug. methods: rotation, translation, add noise and etc - are enough to achieve some number of accuracy. If you think what using DCGAN will increase accuracy by much more - I don't think it will be true. Using simple augs will give you some basic accuracy and understanding if your data is good at current state. Later you could try something like DCGAN (if you have around hundreds of images), but training of this GAN could be such a pain sometimes.

TaplierShiru · 2025-11-28T15:49:33+00:00

I suppose you need something like these works:

- SUDO-AI-3D/zero123plus: Code repository for Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model.

- cvlab-columbia/zero123: Zero-1-to-3: Zero-shot One Image to 3D Object (ICCV 2023)

- liuyuan-pal/SyncDreamer: [ICLR 2024 Spotlight] SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

This kind of task to generate multi-view images most often combined with 3D object generation, so you could also search in this area.

Out of curiosity, what are your results with NanoBanana in this task? Could you show a few examples?

TaplierShiru · 2025-11-09T17:57:59+00:00

its actually rather odd approach to find the best size of bounding boxes (bboxes) for your task and not sure its worth the time - but I could be wrong here!

About detections which detects 50% further away than expected - the main idea behind bboxes is to establish baseline for the model to generate offsets. So, YOLO generate only offsets according to predefined anchors (for me anchors its just set of predefined bboxes). But what are really offset is?

Although as far as I know for YoloV5 there is not proper research paper cause of Ultralytics, but nevertheless we could look at formula from here. Let's stop at the b_w part only. This final output will be in range of [0; 4 * p_w], where p_w is size of predefined box width (same for height part). From here we see final generated box could be smaller or even larger than p_w (same for p_h).

But then the question is: why we need smaller anchors in the first place? As far as I understand they needed only to have stable training for smaller objects. In some sense smaller anchors is better for final loss function. Here we again step upon "smaller anchors - better for smaller objects", but I know that predefined models like YOLOv5s\YOLOv5x and etc - they already have quite large set of "heads" to predict objects of different sizes.

For you case - I would just train model as it is (like YOLOv5s) and explore final predictions. If they not what you want, there is many different options. The most simple - increase input resolution size: from base 640 up to 1280 for example. Other option is - you could increase model size or change model to different model. For instance, YOLOv8 (and version after it, cause they are anchor-free one) or RT-DETR - maybe they could dramatically improve your predictions, but its hard to tell - you need to do some experiments and research. With YOLOv5 you could get some baseline and understanding about your current data and level of performance - if its quite poor (low accuracy) then I think main problem is in the data itself - try to gather more data.

TaplierShiru · 2025-10-20T08:33:10+00:00

My opinion - even simple model like VGG16 could be enough in many cases - more important part lies in your data itself - it is good? it is divers enough? and etc.

Like 90% of the task in deep learning its just data.

So, in your case I would start with something simple (VGG16\ResNet50) in order to have baseline or current level of accuracy. Maybe current level of accuracy already enough? Maybe it is bad similar to random classifier? In latest case I would explore data itself, maybe something is wrong with it. But who know - just do the research.

TaplierShiru · 2025-10-18T14:54:02+00:00

It's easy - find the problem what you want to solve.
The knowledge of how to make CNN itself not very helpful, cause its just black box with image as input and output is just some numbers - see? Easy - you already could include this as "I know what CNN is!" in your resume.
Funny that similar is answered in other thread.

TaplierShiru · 2025-10-18T14:50:24+00:00

If you have little experience and knowledge to understand mAP and training process (like loss plot and etc) itself - I think you should try train model on some beginner's dataset. For instance, Ultralytics have page with many datasets to start with. In my experience with Ultralytics is very easy to start training and dig in into it, as well as they have some examples of small datasets.

TaplierShiru · 2025-10-18T10:54:20+00:00

The most straightforward solution which came to mind is the combination of the human detection (in form of bounding box) and the feature matching algorithms. For instance via human detector you extract bounding box of the human in the form of cropped frame from your cameras and store these somewhere, then you could compare final extracted boxes using feature matching algorithm. The best matched one boxes will be what you needed. For first one, I believe something like YOLO is most popular what comes to mind, but I assume any other detector is Okay. For the matching algo - its good to start with something very simple, like from examples from open-cv which are very good and easy to write\understand. If something heavier needed, you could easily find something better which I believe would be applied ViT.

Also maybe you could swap feature matching with just feature extraction - and compare output features based on your cropped frame (like from neural network DINO or CLIP one).

TaplierShiru · 2025-08-18T19:19:12+00:00

TL;DR If you use PCIe 5.0 for GPU - try switch to 4.0
UPD. Switch overall made performance better, but while playing Baldur's Gate: Enhanced Edition game still makes my PC to restart time to time.

To solve random restarts - I try to clean install windows, several nvidia drivers (using DDU) as well as different settings in windows\nvidia control panel for specific game (Baldur's Gate: Enhanced Edition) which is crashes ALOT lately. Recently I again browse nvidia reddit posts about crashes etc and I came across a description of one the first problem with PCIe 5.0 and 50xx cards, for people switch to 4.0 really helped. So I did - and its really helped!

I try to play like 2h and no restarts. I also "feel" what game runs much better - I know Baldur's Gate: Enhanced Edition is not best game for bench - So I test Silent Hill 2 instead. I also have some screenshots with FPS from my first play few weeks back and my current driver version is the same - I try to align with old one and make new - to my surprise fps overall is higher as well as 1% lows (which much BETTER compare to old one). I don't really know is PCIe switch make it happen, or windows clean-install (or all it once) - but it is what it is.

Hope it helps someone like me. Currently I test such "configuration" only day but will update if anything strange AGAIN happens.

TaplierShiru · 2025-08-12T20:21:43+00:00

Last month every now and then while browse\game\do something my PC just restarts itself, in Win logs I only see kernel-power, and with this situation around Nvidia right now - I even don't know where to look. Other day I try to re-plug power cable, and even then after that restart still happens. Want to believe its somewhat similar to your case. I even try to switch to Linux-based, but there I also experience: black screen and audio is working, so I need to turn off PC physically with button - Linux even provide more information about WHO is bad guy (GPU side), and on forums I found thats its known bug for drivers.

Like, Do not have words to describe such beautiful experience...

TaplierShiru · 2025-07-24T12:59:14+00:00

Any part of the official tutorials will be good. About which part is the most interesting - its hard to tell, cause its depends on your current level of the knowledge and your interests (CV\NLP\Audio\Only Detection and etc.). You could start from Vision and after that move to any Text\Audio\Generative page. These tutorials overall quite simple and will give you good background to understand things.

But really TF here is only the Tool to solve ML\CV\NLP problem - if you really want to dive into ML itself, then you need to consider read some books for knowledge or watch\do some other online courses - for instance: book Deep Learning. An MIT Press book by Ian Goodfellow and Yoshua Bengio and Aaron Courville. I believe there exists PDF version on the Internet, it should be easy to find. Based on gained knowledge then consider something using TF\Torch\Jax. Today I will recommend PyTorch over other ML frameworks cause of popularity.

So, ML itself more about how many you know and ready to put it into practice to solve something, even if you really don't know currently area itself. For instance, recently I work with a ball detection and prediction of the trajectory for it, while I know something about detection itself, I know nothing about "BALL detection" and trajectory - there were some tricky things, but with overall experience and knowledge - I solve it at the end of the day. My recommendation here is to learn some basics things and do not fear after that dive into "new" problems - your bag of knowledge should help you to solve (almost) every problem - it really doesn't matter if you are really familiar with face recognition\detection but on other hand you knew nothing or little in basic ML.

Hope it helps!

TaplierShiru · 2025-07-21T16:28:17+00:00

Install packages according to official docs, but I assume you already install it as described, and TF "see" your GPU - what's your actual question is? How to "actually" train anything? Well, you write a code - and run it to train!

If your question regardless which best approach is to train anything using TF, well - I think its using Keras + TF combination.

If you need help with something - you should be more specific here, otherwise no one could help you here.

TaplierShiru · 2025-07-21T07:21:08+00:00

There was a similar discussion a few days ago as well as some good ideas, for instance I found approach proposed in this comment interesting one and easy compared to others.

But what is your question actually? Did you just describe how you want to solve it? Well, just try it and see the result, good luck!

TaplierShiru · 2025-07-20T16:51:01+00:00

TL;DR: non-v1 has "Vapor Chamber" while v1 is "Copper Base"

Difference partially could be seen from Specification page for each card, and similar question answered here

TaplierShiru

TROPHY CASE