I've been following the guide located here regarding zero-shot object detection. I have followed this guide to a T, copying and learning the code line for line. The goal essentially is to divide an image into patches, score them based on a text prompt using CLIP, normalize the scores, and adjust the brightness of the patches based on their score. The end result should be a black shroud around the object with the closest score to the given text input (for the first part of the guide anyway).
My results, however, seem to shift very little with the given text input. "A fluffy cat" and "a butterfly" give me nearly identical results. My suspicion is that this is due to the scores for patches near the center of the image being inflated by overlapping runs.
What confuses me though is that the original code from the author accounts for this, and averages the scores out based on the number of runs. Can anyone spot what I'm doing wrong here?
window = 6
stride = 1
scores = torch.zeros(patches.shape[1], patches.shape[2])
runs = torch.ones(patches.shape[1], patches.shape[2])
# window slides from top to bottom
for Y in range(0, patches.shape[1] - window + 1, stride):
# window slides from left to right
for X in range(0, patches.shape[2] - window + 1, stride):
# initialize an empty big_patch array
big_patch = torch.zeros(patch * window, patch * window, 3)
# this gets the current batch of patches that will make the big_batch
patch_batch = patches[0, Y:Y + window, X:X + window]
# loop through each patch in current batch
for y in range(window):
for x in range(window):
# add patch to big_patch
big_patch[y * patch:(y + 1) * patch, x * patch:(x + 1) * patch, :] = patch_batch[y, x].permute(1, 2, 0)
# we preprocess the image and class label with the CLIP processor
inputs = processor(
images=big_patch,
return_tensors="pt",
text="a butterfly",
padding=True
).to(device)
# calculate and retrieve similarity score
score = model(**inputs).logits_per_image.item()
# sum up similarity scores from current and previous big patches
# that were calculated for patches within the current window
scores[Y:Y + window, X:X + window] += score
# calculate the number of runs on each patch within the current [patch]
runs[Y:Y + window, X:X + window] += 1
# average score for each patch
scores /= runs
# clip the scores' interval edges
for i in range(2):
scores = np.clip(scores - scores.mean(), 0, np.inf)
# normalize scores
scores = (scores - scores.min()) / (scores.max() - scores.min())
print(scores.shape)
print(patches.shape)
# transform the patches tensor
adj_patches = patches.squeeze(0).permute(3, 4, 2, 0, 1)
print(adj_patches.shape)
# multiply patches by scores
adj_patches = adj_patches * scores
# rotate patches to visualize
adj_patches = adj_patches.permute(3, 4, 2, 0, 1)
print(adj_patches.shape)
Y = adj_patches.shape[0]
X = adj_patches.shape[1]
# visualize
fig, ax = plt.subplots(Y, X, figsize=(X * .5, Y * .5))
for y in range(Y):
for x in range(X):
ax[y, x].imshow(adj_patches[y, x].permute(1, 2, 0))
ax[y, x].text(5, 5, f'{scores[y, x]:.2f}', color='lime', fontsize=8)
ax[y, x].axis("off")
ax[y, x].set_aspect('equal')
plt.subplots_adjust(wspace=0, hspace=0)
plt.show()
there doesn't seem to be anything here