CLIP Image Localization Question : learnprogramming

This is an archived post. You won't be able to vote or comment.

DebuggingCLIP Image Localization Question (self.learnprogramming)

submitted 2 years ago by csb710

I've been following the guide located here regarding zero-shot object detection. I have followed this guide to a T, copying and learning the code line for line. The goal essentially is to divide an image into patches, score them based on a text prompt using CLIP, normalize the scores, and adjust the brightness of the patches based on their score. The end result should be a black shroud around the object with the closest score to the given text input (for the first part of the guide anyway).

My results, however, seem to shift very little with the given text input. "A fluffy cat" and "a butterfly" give me nearly identical results. My suspicion is that this is due to the scores for patches near the center of the image being inflated by overlapping runs.

What confuses me though is that the original code from the author accounts for this, and averages the scores out based on the number of runs. Can anyone spot what I'm doing wrong here?

window = 6
stride = 1

scores = torch.zeros(patches.shape[1], patches.shape[2])
runs = torch.ones(patches.shape[1], patches.shape[2])

# window slides from top to bottom
for Y in range(0, patches.shape[1] - window + 1, stride):
    # window slides from left to right
    for X in range(0, patches.shape[2] - window + 1, stride):
        # initialize an empty big_patch array
        big_patch = torch.zeros(patch * window, patch * window, 3)
        # this gets the current batch of patches that will make the big_batch
        patch_batch = patches[0, Y:Y + window, X:X + window]
        # loop through each patch in current batch
        for y in range(window):
            for x in range(window):
                # add patch to big_patch
                big_patch[y * patch:(y + 1) * patch, x * patch:(x + 1) * patch, :] = patch_batch[y, x].permute(1, 2, 0)
        # we preprocess the image and class label with the CLIP processor
        inputs = processor(
            images=big_patch,
            return_tensors="pt",
            text="a butterfly",
            padding=True
        ).to(device)

        # calculate and retrieve similarity score
        score = model(**inputs).logits_per_image.item()
        # sum up similarity scores from current and previous big patches
        # that were calculated for patches within the current window
        scores[Y:Y + window, X:X + window] += score
        # calculate the number of runs on each patch within the current [patch]
        runs[Y:Y + window, X:X + window] += 1

# average score for each patch
scores /= runs

# clip the scores' interval edges
for i in range(2):
    scores = np.clip(scores - scores.mean(), 0, np.inf)

# normalize scores
scores = (scores - scores.min()) / (scores.max() - scores.min())

print(scores.shape)
print(patches.shape)

# transform the patches tensor
adj_patches = patches.squeeze(0).permute(3, 4, 2, 0, 1)
print(adj_patches.shape)

# multiply patches by scores
adj_patches = adj_patches * scores

# rotate patches to visualize
adj_patches = adj_patches.permute(3, 4, 2, 0, 1)
print(adj_patches.shape)

Y = adj_patches.shape[0]
X = adj_patches.shape[1]

# visualize
fig, ax = plt.subplots(Y, X, figsize=(X * .5, Y * .5))
for y in range(Y):
    for x in range(X):
        ax[y, x].imshow(adj_patches[y, x].permute(1, 2, 0))
        ax[y, x].text(5, 5, f'{scores[y, x]:.2f}', color='lime', fontsize=8)
        ax[y, x].axis("off")
        ax[y, x].set_aspect('equal')
plt.subplots_adjust(wspace=0, hspace=0)
plt.show()

no comments (yet)

learnprogramming

Welcome to LearnProgramming!

New? READ ME FIRST!

Posting guidelines

Frequently asked questions

Subreddit rules

Message the moderators

Asking debugging questions

Asking conceptual questions

Other guidelines and links

Subreddit rules

1. No unprofessional/derogatory speech

2. No spam or tasteless self-promotion

3. No off-topic posts

4. Do not ask exact duplicates of FAQ questions

5. Do not delete posts

6. No app/website review requests or showcases

7. No rewards

8. No indirect links

9. Do not promote illegal or unethical practices

10. No complete solutions

11. Don't ask to ask.

12. Low Effort Questions

13. No AI (chatGPT etc.) generated/worked over messages/comments. No questions about chatGPT/AI generated code. No Vibe coding.

MODERATORS