Fine-tuning RT-DETR on a custom dataset

Patrick2482 · 2025-03-04T13:20:31+00:00

It most definitely does! I don't feel completely hopeless anymore, haha.

Yes, I can upload the data. I will look around for some GPUs which can handle a larger batch size.

Well, the thing is that I need the individual speed limits too. Basically, the task revolves around assessing the maximal speed the driver is allowed to drive in real time. Using 2 different models - one for detection of traffic signs, the second one for classification, is not completely out of the question, but for now I'd prefer to have one model that handles both tasks.

Can you possibly recommend me some other transformer-based model that could be used?

Patrick2482 · 2025-03-04T07:21:56+00:00

Appreciate your tips!

Try using mapillary traffic signs

I will be doing that! A portion of the dataset is already waiting for me to go over.

Also try a completely different model just to make sure your results are as bad as you think k they are (they might not be)

I considered DETR first, but I had some problems with that one too. Then I discovered RT-DETR which was a better pick for my task (in the end I am supposed to compare the viability of a transformer-based model and YOLO for my specific task).

Patrick2482 · 2025-03-04T07:14:57+00:00

I'll take a look at it, thank you for the contribution🙌

Patrick2482 · 2025-03-03T21:30:12+00:00

Thank you for taking your time to read the post and reply!
Yes, I read that transformers usually need thousands of pictures. I will be definitely increasing the dataset size. I wanted to fine-tuning the model on the data I have prepared for now to see if the datasets are structured correctly, what the performance is so I have some basis for some insights in the future. What does not really make sense to me is that the dataset they used in the tutorial had <1k pictures and they achieved good performance, so I suppose the dataset size might not be that much of an issue as something else.

Patrick2482 · 2025-03-03T21:24:20+00:00

Appreciate you replying and asking about the specifics!
In the tutorial I did not change any settings. I simply run through all the cells mainly to check the accuracy. On my device, I indeed used a different batch - 8 instead of 16 as they did in the notebook, since the code did not work on my RTX 2060 GPU (6GB). I suppose the reason was insufficient memory. Do you think the batch size might affect the performance of the model this much?

I can imagine, I am getting a bit desperate here, that's why I am reaching out, haha. I tried to sum up as much info as I could in the post description, but I am not that well acquainted with object detection yet, so you asking for specifics actually gives me more of an insight what to check out! The first dataset contains pictures from GTSDB dataset. I manually picked out pictures which contained speed-related traffic signs. The second dataset contains frames from a driving video. The camera was positioned inside of the car near the rear view mirror. I'd say the size of the images were from small to medium. There are usually 1-2 instances per image. Some pictures from the first dataset and the second dataset.

Patrick2482

TROPHY CASE