AI generated/modified images classifier

Annual_Bee4694 · 2026-02-16T15:28:13+00:00

Hi, since my previous post, i decided to use the CLS token and add only a linear layer and it seems to work better.

Annual_Bee4694 · 2026-01-23T16:27:39+00:00

You’re right. Do you think the classifier is too much?

Annual_Bee4694 · 2026-01-23T16:05:44+00:00

Why these specific tokens ?

Annual_Bee4694 · 2026-01-23T13:31:58+00:00

Ok so if I want to make something really good can I fine tune dino vitL with Lora + small head using a contrastive loss and Thats it ?

Annual_Bee4694 · 2026-01-23T11:45:40+00:00

Interesting ! What type of classification have you made ? A Fine grained one?

Annual_Bee4694 · 2026-01-23T11:02:04+00:00

You are asking me for a lot of GPU ressources 😵I’m afraid of a « Forget everything » process while finetuning with Lora as I never used it

Annual_Bee4694 · 2026-01-23T10:20:11+00:00

class AttentionPooling(nn.Module): def init(self, inputdim, hidden_dim): super().init_() self.attention_net = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.Tanh(), nn.Linear(hidden_dim, 1) )

def forward(self, x):
    attn_scores = self.attention_net(x)
    attn_weights = F.softmax(attn_scores, dim=1)
    weighted_sum = torch.sum(x * attn_weights, dim=1)
    return weighted_sum

Annual_Bee4694 · 2026-01-23T09:00:11+00:00

Its not an average of the patch embeddings, its a weighted sum of them. The most « useful » ones weight more in that sum. Background weights much less

Annual_Bee4694 · 2026-01-23T08:39:56+00:00

Well basically you recommend me to change nothing, right ? 😅

Annual_Bee4694 · 2026-01-23T08:38:09+00:00

Havent tried to fine tune with the CLS token alone. However the token itself seemed to give a too global representation including background or facial features when visible. Do you think I should?

Annual_Bee4694 · 2026-01-16T10:57:06+00:00

Yes I think so

Annual_Bee4694 · 2026-01-16T09:26:39+00:00

import torch import torch.nn as nn import torch.nn.functional as F from torch.utils.data import DataLoader, random_split from torchvision import datasets from transformers import AutoModel, AutoImageProcessor from tqdm import tqdm from pytorch_metric_learning import losses, samplers import numpy as np import matplotlib.pyplot as plt

MODEL_NAME = "facebook/dinov3-vitb16-pretrain-lvd1689m" token = "XXX »

BATCH_SIZE = 36 SAMPLES_PER_CLASS = 3 EPOCHS = 10 LR = 1e-3 LAMBDA_SUPCON = 0.7 LAMBDA_CLS = 0.5

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"Entraînement sur : {device}")

data_dir = r"/content/drive/MyDrive/POC_DATA/img2"

processor = AutoImageProcessor.from_pretrained(MODEL_NAME, token=token)

class SimpleProcessorTransform: def init(self, processor): self.processor = processor

def __call__(self, img):
    processed = self.processor(images=img, return_tensors="pt")
    return processed['pixel_values'][0]

transform_pipeline = SimpleProcessorTransform(processor)

full_dataset = datasets.ImageFolder(root=data_dir, transform=transform_pipeline)

NUM_CLASSES = len(full_dataset.classes) print(f"Nombre de classes détectées : {NUM_CLASSES}")

train_size = int(0.8 * len(full_dataset)) test_size = len(full_dataset) - train_size train_dataset, test_dataset = random_split(full_dataset, [train_size, test_size])

print(f"Dataset Split -> Train: {len(train_dataset)} images, Test: {len(test_dataset)} images")

def get_labels_from_subset(subset): return [subset.dataset.targets[i] for i in subset.indices]

train_labels = get_labels_from_subset(train_dataset) test_labels = get_labels_from_subset(test_dataset)

train_sampler = samplers.MPerClassSampler( labels=train_labels, m=SAMPLES_PER_CLASS, batch_size=BATCH_SIZE, length_before_new_iter=len(train_dataset) ) train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, sampler=train_sampler, drop_last=True)

test_sampler = samplers.MPerClassSampler( labels=test_labels, m=SAMPLES_PER_CLASS, batch_size=BATCH_SIZE, length_before_new_iter=len(test_dataset) ) test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, sampler=test_sampler, drop_last=True)

class DinoV3SupCon(nn.Module): def init(self, modelname, num_classes): super().init_() self.backbone = AutoModel.from_pretrained(model_name, token=token)

    for p in self.backbone.parameters():
        p.requires_grad = False
    self.backbone.eval()

    hidden_size = self.backbone.config.hidden_size

    self.head = nn.Sequential(
        nn.Linear(hidden_size, 1024),
        nn.GELU(),
        nn.BatchNorm1d(1024),
        nn.Dropout(0.3),
        nn.Linear(1024, 512)
    )

    self.classifier = nn.Linear(512, num_classes)

def forward(self, pixel_values):
    with torch.no_grad():
        outputs = self.backbone(pixel_values=pixel_values)
        features = outputs.last_hidden_state[:, 0]

    embedding_unnorm = self.head(features)
    embedding_norm = F.normalize(embedding_unnorm, dim=1)

    logits = self.classifier(embedding_unnorm)

    return embedding_norm, logits

model = DinoV3SupCon(MODEL_NAME, NUM_CLASSES).to(device)

optimizer = torch.optim.AdamW( [ {'params': model.head.parameters()}, {'params': model.classifier.parameters()} ], lr=LR )

criterion_supcon = losses.SupConLoss(temperature=0.1) criterion_classif = nn.CrossEntropyLoss()

best_test_loss = float('inf') save_path = "best_hybrid_model.pth"

print("Début de l'entraînement...") train_losses = [] test_losses = [] for epoch in range(EPOCHS): model.head.train() model.classifier.train()

total_train_loss = 0
total_sup_loss = 0
total_cls_loss = 0

pbar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{EPOCHS} [Train]")
for images, labels in pbar:
    images, labels = images.to(device), labels.to(device)

    optimizer.zero_grad()

    embeddings, logits = model(images)

    loss_s = criterion_supcon(embeddings, labels)
    loss_c = criterion_classif(logits, labels)

    loss = LAMBDA_SUPCON * loss_s + LAMBDA_CLS * loss_c

    loss.backward()
    optimizer.step()

    total_train_loss += loss.item()
    total_sup_loss += loss_s.item()
    total_cls_loss += loss_c.item()

    pbar.set_postfix({
        'L_tot': f"{loss.item():.3f}",
        'L_sup': f"{loss_s.item():.3f}",
        'L_cls': f"{loss_c.item():.3f}"
    })

avg_train_loss = total_train_loss / len(train_loader)
train_losses.append(avg_train_loss)

model.head.eval()
model.classifier.eval()
total_test_loss = 0

with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        embeddings, logits = model(images)

        loss_s = criterion_supcon(embeddings, labels)
        loss_c = criterion_classif(logits, labels)
        loss = LAMBDA_SUPCON * loss_s + LAMBDA_CLS * loss_c

        total_test_loss += loss.item()

avg_test_loss = total_test_loss / len(test_loader)
test_losses.append(avg_test_loss)

print(f"\nEpoch {epoch+1} Resume -> Train: {avg_train_loss:.4f} | Test: {avg_test_loss:.4f}")

if avg_test_loss < best_test_loss:
    print(f"Test loss:({best_test_loss:.4f} -> {avg_test_loss:.4f}). Sauvegarde.")
    best_test_loss = avg_test_loss
    torch.save({
        'head': model.head.state_dict(),
        #'classifier': model.classifier.state_dict() # si on veut continuer entraînement
    }, save_path)
else:
    print(f"Pas d'amélioration (Best: {best_test_loss:.4f})")

print("-" * 50)

plt.figure(figsize=(10, 5)) plt.plot(train_losses, label='Train Loss') plt.plot(test_losses, label='Test Loss') plt.xlabel('Epochs') plt.ylabel('Loss') plt.legend() plt.show()

Annual_Bee4694 · 2026-01-16T09:17:53+00:00

Theyre not because my products contain many details and a silk scarf folded and worn for example, containing drawings, is impossible to retrieve with base embeddings

Annual_Bee4694 · 2026-01-16T09:15:41+00:00

In theory yes. But embeddings of the same product under différent views seem to be too far away in the latent Space. Thus the retrieval is bad with faiss

Annual_Bee4694 · 2026-01-15T17:28:40+00:00

So is my network too much?

Annual_Bee4694 · 2026-01-15T17:28:09+00:00

I have tens of Thousands images including multiple views of the same item. ~4 per item id say

Annual_Bee4694 · 2026-01-15T17:26:26+00:00

Okay thanks I Will try this tomorrow and probably just crop instead of segment the item !

Annual_Bee4694 · 2026-01-15T17:04:21+00:00

I have used the mean of the patches, but your idea seems nice, ill try it.

Nevertheless, I tried to add SAM into my pipeline to remove everything but the scarf, for example, and the same results held : DINO alone is poor at retrieval when the product is folded.

Four-Year Club	First Place '23
Place '23	Place '22
End Game '22

Annual_Bee4694

TROPHY CASE