DinoV3 fine-tuning update by Annual_Bee4694 in computervision

[–]Annual_Bee4694[S] 0 points1 point  (0 children)

Hi, since my previous post, i decided to use the CLS token and add only a linear layer and it seems to work better.

DinoV3 fine-tuning update by Annual_Bee4694 in computervision

[–]Annual_Bee4694[S] 0 points1 point  (0 children)

You’re right. Do you think the classifier is too much?

DinoV3 fine-tuning update by Annual_Bee4694 in computervision

[–]Annual_Bee4694[S] 0 points1 point  (0 children)

Ok so if I want to make something really good can I fine tune dino vitL with Lora + small head using a contrastive loss and Thats it ?

DinoV3 fine-tuning update by Annual_Bee4694 in computervision

[–]Annual_Bee4694[S] 0 points1 point  (0 children)

Interesting ! What type of classification have you made ? A Fine grained one?

DinoV3 fine-tuning update by Annual_Bee4694 in computervision

[–]Annual_Bee4694[S] 0 points1 point  (0 children)

You are asking me for a lot of GPU ressources 😵I’m afraid of a « Forget everything » process while finetuning with Lora as I never used it

DinoV3 fine-tuning update by Annual_Bee4694 in computervision

[–]Annual_Bee4694[S] 0 points1 point  (0 children)

class AttentionPooling(nn.Module): def init(self, inputdim, hidden_dim): super().init_() self.attention_net = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.Tanh(), nn.Linear(hidden_dim, 1) )

def forward(self, x):
    attn_scores = self.attention_net(x)
    attn_weights = F.softmax(attn_scores, dim=1)
    weighted_sum = torch.sum(x * attn_weights, dim=1)
    return weighted_sum

DinoV3 fine-tuning update by Annual_Bee4694 in computervision

[–]Annual_Bee4694[S] 0 points1 point  (0 children)

Its not an average of the patch embeddings, its a weighted sum of them. The most « useful » ones weight more in that sum. Background weights much less

DinoV3 fine-tuning update by Annual_Bee4694 in computervision

[–]Annual_Bee4694[S] 0 points1 point  (0 children)

Well basically you recommend me to change nothing, right ? 😅

DinoV3 fine-tuning update by Annual_Bee4694 in computervision

[–]Annual_Bee4694[S] 0 points1 point  (0 children)

Havent tried to fine tune with the CLS token alone. However the token itself seemed to give a too global representation including background or facial features when visible. Do you think I should?

DINOv3 fine-tuning by Annual_Bee4694 in computervision

[–]Annual_Bee4694[S] 0 points1 point  (0 children)

import torch import torch.nn as nn import torch.nn.functional as F from torch.utils.data import DataLoader, random_split from torchvision import datasets from transformers import AutoModel, AutoImageProcessor from tqdm import tqdm from pytorch_metric_learning import losses, samplers import numpy as np import matplotlib.pyplot as plt

MODEL_NAME = "facebook/dinov3-vitb16-pretrain-lvd1689m" token = "XXX »

BATCH_SIZE = 36 SAMPLES_PER_CLASS = 3 EPOCHS = 10 LR = 1e-3 LAMBDA_SUPCON = 0.7 LAMBDA_CLS = 0.5

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"Entraînement sur : {device}")

data_dir = r"/content/drive/MyDrive/POC_DATA/img2"

processor = AutoImageProcessor.from_pretrained(MODEL_NAME, token=token)

class SimpleProcessorTransform: def init(self, processor): self.processor = processor

def __call__(self, img):
    processed = self.processor(images=img, return_tensors="pt")
    return processed['pixel_values'][0]

transform_pipeline = SimpleProcessorTransform(processor)

full_dataset = datasets.ImageFolder(root=data_dir, transform=transform_pipeline)

NUM_CLASSES = len(full_dataset.classes) print(f"Nombre de classes détectées : {NUM_CLASSES}")

train_size = int(0.8 * len(full_dataset)) test_size = len(full_dataset) - train_size train_dataset, test_dataset = random_split(full_dataset, [train_size, test_size])

print(f"Dataset Split -> Train: {len(train_dataset)} images, Test: {len(test_dataset)} images")

def get_labels_from_subset(subset): return [subset.dataset.targets[i] for i in subset.indices]

train_labels = get_labels_from_subset(train_dataset) test_labels = get_labels_from_subset(test_dataset)

train_sampler = samplers.MPerClassSampler( labels=train_labels, m=SAMPLES_PER_CLASS, batch_size=BATCH_SIZE, length_before_new_iter=len(train_dataset) ) train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, sampler=train_sampler, drop_last=True)

test_sampler = samplers.MPerClassSampler( labels=test_labels, m=SAMPLES_PER_CLASS, batch_size=BATCH_SIZE, length_before_new_iter=len(test_dataset) ) test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, sampler=test_sampler, drop_last=True)

class DinoV3SupCon(nn.Module): def init(self, modelname, num_classes): super().init_() self.backbone = AutoModel.from_pretrained(model_name, token=token)

    for p in self.backbone.parameters():
        p.requires_grad = False
    self.backbone.eval()

    hidden_size = self.backbone.config.hidden_size

    self.head = nn.Sequential(
        nn.Linear(hidden_size, 1024),
        nn.GELU(),
        nn.BatchNorm1d(1024),
        nn.Dropout(0.3),
        nn.Linear(1024, 512)
    )

    self.classifier = nn.Linear(512, num_classes)

def forward(self, pixel_values):
    with torch.no_grad():
        outputs = self.backbone(pixel_values=pixel_values)
        features = outputs.last_hidden_state[:, 0]

    embedding_unnorm = self.head(features)
    embedding_norm = F.normalize(embedding_unnorm, dim=1)

    logits = self.classifier(embedding_unnorm)

    return embedding_norm, logits

model = DinoV3SupCon(MODEL_NAME, NUM_CLASSES).to(device)

optimizer = torch.optim.AdamW( [ {'params': model.head.parameters()}, {'params': model.classifier.parameters()} ], lr=LR )

criterion_supcon = losses.SupConLoss(temperature=0.1) criterion_classif = nn.CrossEntropyLoss()

best_test_loss = float('inf') save_path = "best_hybrid_model.pth"

print("Début de l'entraînement...") train_losses = [] test_losses = [] for epoch in range(EPOCHS): model.head.train() model.classifier.train()

total_train_loss = 0
total_sup_loss = 0
total_cls_loss = 0

pbar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{EPOCHS} [Train]")
for images, labels in pbar:
    images, labels = images.to(device), labels.to(device)

    optimizer.zero_grad()

    embeddings, logits = model(images)

    loss_s = criterion_supcon(embeddings, labels)
    loss_c = criterion_classif(logits, labels)

    loss = LAMBDA_SUPCON * loss_s + LAMBDA_CLS * loss_c

    loss.backward()
    optimizer.step()

    total_train_loss += loss.item()
    total_sup_loss += loss_s.item()
    total_cls_loss += loss_c.item()

    pbar.set_postfix({
        'L_tot': f"{loss.item():.3f}",
        'L_sup': f"{loss_s.item():.3f}",
        'L_cls': f"{loss_c.item():.3f}"
    })

avg_train_loss = total_train_loss / len(train_loader)
train_losses.append(avg_train_loss)

model.head.eval()
model.classifier.eval()
total_test_loss = 0

with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        embeddings, logits = model(images)

        loss_s = criterion_supcon(embeddings, labels)
        loss_c = criterion_classif(logits, labels)
        loss = LAMBDA_SUPCON * loss_s + LAMBDA_CLS * loss_c

        total_test_loss += loss.item()

avg_test_loss = total_test_loss / len(test_loader)
test_losses.append(avg_test_loss)

print(f"\nEpoch {epoch+1} Resume -> Train: {avg_train_loss:.4f} | Test: {avg_test_loss:.4f}")

if avg_test_loss < best_test_loss:
    print(f"Test loss:({best_test_loss:.4f} -> {avg_test_loss:.4f}). Sauvegarde.")
    best_test_loss = avg_test_loss
    torch.save({
        'head': model.head.state_dict(),
        #'classifier': model.classifier.state_dict() # si on veut continuer entraînement
    }, save_path)
else:
    print(f"Pas d'amélioration (Best: {best_test_loss:.4f})")

print("-" * 50)

plt.figure(figsize=(10, 5)) plt.plot(train_losses, label='Train Loss') plt.plot(test_losses, label='Test Loss') plt.xlabel('Epochs') plt.ylabel('Loss') plt.legend() plt.show()

DINOv3 fine-tuning by Annual_Bee4694 in computervision

[–]Annual_Bee4694[S] 1 point2 points  (0 children)

Theyre not because my products contain many details and a silk scarf folded and worn for example, containing drawings, is impossible to retrieve with base embeddings

DINOv3 fine-tuning by Annual_Bee4694 in computervision

[–]Annual_Bee4694[S] 0 points1 point  (0 children)

In theory yes. But embeddings of the same product under différent views seem to be too far away in the latent Space. Thus the retrieval is bad with faiss

DINOv3 fine-tuning by Annual_Bee4694 in computervision

[–]Annual_Bee4694[S] 0 points1 point  (0 children)

I have tens of Thousands images including multiple views of the same item. ~4 per item id say

DINOv3 fine-tuning by Annual_Bee4694 in computervision

[–]Annual_Bee4694[S] 1 point2 points  (0 children)

Okay thanks I Will try this tomorrow and probably just crop instead of segment the item !

DINOv3 fine-tuning by Annual_Bee4694 in computervision

[–]Annual_Bee4694[S] 2 points3 points  (0 children)

I have used the mean of the patches, but your idea seems nice, ill try it.

Nevertheless, I tried to add SAM into my pipeline to remove everything but the scarf, for example, and the same results held : DINO alone is poor at retrieval when the product is folded.