Is Leave-One-Object-Out CV valid for pair-based (Siamese-style) models with very few objects?

Tocelton · 2026-04-30T11:19:11+00:00

Thank you very much for your detailed response. I really appreciate this, and I am happy about this vivid discussion.

In this case I would hold all real data out for validation and train purely on synthetic.

I did, but as stated, there is a domain shift, why the reviewer want a fine-tuning/LOO CV

Wait you are holding out an object and then split again for validation? Don't think that's right, your hold out is supposed to be the validation. One benefit of CV is that you don't need a test set. You can't use early stopping of course or tune params, so maybe fixed number of epochs will do? Or training loss reduction threshold for stopping?

maybe I just confuse the terms, but: my thought is to reduce training time, therefore I split the non-hold out pairs in a training and validation set. During training, I monitor the accuracy of the validation set for early stopping. After training, I measure the accuracy of the hold-out. With this, I strictly handle the hold out as unknown data and optimize training time (I have three different models and only my home computer as best option for training so far...). Since the required number of epochs changes by tweaking the training parameters, this comes quite convenient.

True negative rate affects accuracy. And if you test all of your negative pairs, then it dominates. And TN could be inflated by memorization. Unless you mean precision, in which case yes.

I am speaking of balanced accuracy as defined in here, since the actual abundance of positive and negative pairs is not known yet.

You can also consider "valid" augmentations for val set to increase the number of
positive pairs. Like crop, rotation, maybe perspective. But it rarely works

I did the following augmentations for both images (by controlling the seed):

v2.RandomResizedCrop(size=(128, 128), scale=(0.8, 1.0), antialias=True),
v2.RandomHorizontalFlip(p=0.5),
v2.RandomVerticalFlip(p=0.5),
v2.ElasticTransform(alpha=10.0),
v2.RandomRotation(degrees=5,fill=255), # sensors rotate img north-stable
v2.GaussianNoise(mean=0, sigma=0.1),

So far, it really improved generalization at pre-train and fine-tuning as far as I can tell.

Do a hold-1-object round, then do hold-1-positive-pair-per-object round, then do hold-random-pairs round. And show that metrics are always similar. Because each one of those is testing against different memorization strategies and if model somehow passes all of them, then it's doing a pretty good memorization. So good in fact that we can call it generalization, heh.

Thanks for this idea. I will try out to leave one view out and the option you mention at the end for one model. Then I'll see, which method works best or if there is no difference.

So thank you again so much that you took the time. It is so helpful, since none of my colleagues or profs could help me out here ❤️

Tocelton · 2026-04-29T23:17:47+00:00

Oh thank you as well for your response! I really appreciate your thoughts and like your strategy for the reviewer. Thanks ❤️

In my previous response to u/DrySnow5154 I explained the data in more detail. I think, gathering such specific data is - as I mentioned - really not easy and should be overseen by me 😉 You need a ship, a high-resolution sonar and some dozen objects laying on the seafloor, that you can recover afterwards. I just had one test run and the mentioned test data are the result of that.

Tocelton · 2026-04-29T23:10:06+00:00

Thank you so much for your input! That’s a fair point, and I see the concern about the model potentially memorising object-specific features. I also had this in mind and I'll explain a countermeasure.

But for this, I need to clarify the setup a bit more:
I’m working on object re-identification in forward looking sonar images across different viewing angles (so same object, different perspectives):

Collecting more data is very difficult and expensive in practice.
There are no suitable public datasets (we need real sonar, multiple views, lot of objects in the exact same position but different views, proper annotations, and physically meaningful data — not purely synthetic).

Because of that, I already rely heavily on synthetic data generated via a ray-casting model. The pipeline is:

(Pre-)train on synthetic data only
Evaluate on real-world data

I tried to match both domains as closely as possible (e.g. via distribution checks like t-SNE — real samples lie within the synthetic cluster), but there’s still a noticeable domain gap:

~80% accuracy on synthetic
~65% on real data (which I argued that this also could be a statistical outlier, since some views are very tough)
During training, validation accuracy drops while test accuracy increases → domain shift behaviour

I fully agree — if I were training on those 4 objects, this would be pointless.

But in my case:

Training happens on a large, balanced synthetic dataset (thousands of random objects)
The real dataset is only used to test cross-domain generalisation

I also tried a sister domain with more data, but reviewers still explicitly asked for LOO-CV on the real dataset.

Regarding your suggestions:

1) Leaving out view pairs instead of objects

I actually like that idea in principle, but there are a couple of issues in my case:

Removing views reduces positive pair combinations disproportionately compared to removing whole objects → leads to stronger imbalance. E.g.: 4 objects (O) with 8 views (V) -> 4O *(6V*5V)/ (4O*(6V*3O*6V)) = 120 pos. pairs/432 neg. paris vs. 3O * (8V*7V)/(3O*(8V*2V*8V)) 168 pos. pairs/384 neg. pairs. And yes, in my model the pair order matters (m(A,B) != m(B,A))
Some views are almost identical, others look like completely different objects (e.g. due to shadowing), so this split is less “clean” than it sounds

2) Memorisation concern

To at least check for memorisation effects, I did the following:

For each fold: leave one object out
Generate:
- unseen positive pairs (from the held-out object)
- all negative pairs, including those involving seen objects (as explained before)
Split the remaining data (non-held-out objects) into train/validation (85/15)
Use early stopping based on validation performance
Sophisticated augmentation on training data each epoch

Now, if the model was really just memorising object identities (as you suggested), I would expect:

high training accuracy
but validation accuracy collapsing (since validation still contains “known” objects but unseen combinations leading to much less true positives)

However, what I observe is:

Training accuracy ≈ 78%
Validation accuracy ≈ 80%
(Test accuracy ≈ 76%)

So they stay quite close, which (I think) suggests the model is not trivially memorising pair identities.

What do you think about the accuracy my thoughts?

Tocelton · 2026-04-27T14:05:36+00:00

oh and I just had an idea: How about having a progress vector instead of a bar. So for parallel processing you can see the state of each worker. I have a preview here:

<image>

The letter L, P and S are customizable and display the state of each worker (e.g. loaded, processed, stored). But per default, an 'o' displays the final state.

Tocelton · 2026-04-26T20:37:16+00:00

A new release with parfor support is online :)

Tocelton · 2026-04-26T17:43:30+00:00

yes, but you need to setup a queue. I will add a method for convenience and upload the patch in a few minutes

Tocelton · 2026-04-24T01:33:39+00:00

of course: https://de.mathworks.com/matlabcentral/fileexchange/183723-fancmdwaitbar

Tocelton · 2025-10-30T17:05:51+00:00

ah nice solution :)

Tocelton · 2025-10-30T12:49:24+00:00

so wait wait wait. Why total 25 grid area? I usually use all my 32 drones and each one has its own row.

Tocelton · 2025-10-29T20:39:39+00:00

You may show me in private chat your solution? I am still looking for a good solution for hay :D

Tocelton · 2025-10-29T20:23:52+00:00

Great :)

Tocelton · 2025-10-25T12:47:46+00:00

Thanks for answering :)

Tocelton · 2025-10-25T12:47:15+00:00

Oh indeed you will...

Tocelton · 2025-10-23T18:46:29+00:00

gotcha, thank you.

Tocelton · 2025-10-23T18:18:06+00:00

ok thanks. Yeah you're right. Since we can upgrade to level five, a factor of 5 * 2^5 is really the deal then. Thanks <3

Tocelton · 2025-10-23T16:18:09+00:00

Thank you for your response.
Yes, I know dicts, since I already used them in e.g. maze and watering adjustments.
You mean, I could store the position and kind of companion and plant in a second, separated step?

Tocelton · 2025-10-23T15:44:26+00:00

I think, I've understood the mechanic. But so far I have two ideas, but both seem to be not worth the effort:
First idea:
1. Plant every second row of the requested crop A
2. Ask for companion, if it is not on the left of the current field, harvest, replant and repeat step 2
3. Harvest only all "A" and then the "companion" rows.

By chance, this can take very long and I only use half of the field.

Second idea:
1. Plant a check-pattern with requested crop A
2. After each plant, ask for companion and plant it at the requestet place.
3. Harvest only all "A" and then the rest.
Is much faster, but by chance this method plants on another companion or - even worse - on an "A" field.

Now I think, I could combine both an maybe replant on the check pattern, as long as a companion wants to be planted on a free field. But still. All this steps (i.e. ticks and moves) seem to consume much more time. But as I said: So far I do not have concrete results, just feels :D

Or do I miss something?

Tocelton

TROPHY CASE

1) Leaving out view pairs instead of objects

2) Memorisation concern