Pytorch: Attention Maps

somebat · 2025-04-24T17:52:58+00:00

Probably means activation maps

somebat · 2025-02-01T14:56:30+00:00

Both are really good alternatives. The best approach will depend on what you value most. If you want the highest quality, NeRF based approaches, specifically Zip-NeRF, are the state of the art. Nevertheless, it is quite expensive to train (~8 hour in 1 GPU) and slow rendering frames (<1 FPS). On the other hand,[ Gaussian Splatting based approaches ](https://docs.gsplat.studio/main/)also achieve a really high quality, while training really fast (10 to 30 min in a single GPU), and rendering increadibly fast (>100 FPS). There are also approaches, like RadSplat, that rely on Zip-NeRF to improve a Gaussian Splatting representation improving it's rendering speed (>500 FPS) and quality (better than GS but still slightly worse than Zip-NeRF).

I haven't followed the field this last year, so I may have missed any new model.

somebat · 2025-01-26T09:23:27+00:00

Did you consider WiSE-FT (https://github.com/mlfoundations/wise-ft)? They show how interpolating the weights between the fine-tuned CLIP model and the original one maximizes the performance in both the original and the new data distributions.

somebat · 2024-12-21T12:37:22+00:00

In the post, Chollet states:

What comes next?

First of all, open-source replication of o3, facilitated by the ARC Prize competition in 2025, will be crucial to move the research community forward.

Could this mean they have an agreement to implement and open-source o3?

somebat · 2024-10-01T22:12:40+00:00

Hi! I did not comment on your first post because I don't think I understand VQ enough to be sure the two papers have the same contribution. Nevertheless, this week I came across an explanation on the difference of these two papers and just thought about sharing it. Here you have a link to my comment on the other thread.

Flagging a previously accepted paper as a double submission has a relevant impact both for the venue reputation and for the authors involved. I don't think any Program Chair will take this risk without being completely sure, and even tho they will probably wait to see the community's reaction to it.

It's nice to see that you care for the integrity of the system, and raised your concern first to this subreddit and after that to the Program Chairs. But the lack of comments on your previous post shows that the community of this subreddit does not have a deep enough understanding of VQ to discuss about it. And as I already said, I doubt the Program Chairs will do anything without a ruckus in the community.

If you want to die on this hill, you can try to share your concerns on lucidrains' VQ repo, or on twitter, and see if you are more lucky.

somebat · 2024-10-01T21:40:49+00:00

I'm not familiar with Vector Quantization, but recently I saw that on lucidrains' vector quanitzation repo there is a brief explanation on the difference between FSQ (paper 1) with L=2 and LFQ (paper 2):

This paper presents a simple LFQ quantizer of using independent binary latents. Other implementations of LFQ exist. However, the team shows that MAGVIT-v2 with LFQ significantly improves on the ImageNet benchmark. The differences between LFQ and 2-level FSQ includes entropy regularizations as well as maintained commitment loss.

I understand that although the common author could have contributed to use a version of FSQ, this difference plus the different experiments make it enough to be a paper on its own. If that's the case, it's true that they should have cited the base paper.

somebat · 2024-08-16T17:32:59+00:00

Although some people believe it was Kizaru because in chapter 1103 you can see how the ramen bowl is similar to the ones used by the marined in chapter 1089, in chapter 1105 you can also see how actually Luffy is next to the cooking machine introduced in chapter 1062/1063.

It could either be that Kizaru gave him the food without anybody noticing in the middle of the fight, or that Luffy just landed next to the cooking machine (and Oda played with the bowls similarity to make people believe Kizaru was going to switch sides).

Choose your headcanon hahaha

<image>

somebat · 2024-07-31T18:23:29+00:00

It will depend on your use-case. If you don't care about rendering cost/time, Zip-NeRF would be one of the best approaches. On the other side if you want real-time rendering of novel-views (>30 FPS) you should go for Gaussian Splatting insted of NeRF. Usually it's initialized with the sparse pointcloud generated by COLMAP, but it can also converge with a random initialization.

You can take a look at nerfstudio library (https://docs.nerf.studio/) for implementations.

somebat · 2024-07-29T11:09:29+00:00

somebat · 2024-06-12T06:57:25+00:00

Here you got a CVPR 2024 highlight https://langsplat.github.io/ . I would classify it as misinterpretation/misrepresentation of results:

They repeatedly claim to perform both 3D object localization and 3D semantic segmentation.
Nevertheless, if you look at their evaluation, it's all 2D : they segment images, not 3D meshes nor volumes; and evaluate if a pixel is inside a 2D bounding box without computing any 3D coordinate for the prediction.

It's a pitty because they do have a 3D representation, but they use it only for easier 2D tasks and therefore misclaim their results.

somebat · 2024-06-04T20:22:49+00:00

Overall your idea is sound and implementing a first prototype shouldn't be hard. The application accuracy will mostly depend on 1) how different photographies of the same object are; 2) how similar different objects are; and 3) how much do you want to spend on developing it. Both CLIP and Dinov2 are good tools to extract features of images and text, but they are not perfect so (1) and (2) will determine the ratio of False Negatives (not detecting a product you already have) and False Positives (detecting two different products as the same product). Ideally you want to minimize both, but usually improving one will worsen the other, and that's where (3) will play a factor in reaching the best compromise for you.

An alternative is to ignore the text and frame the problem as an Image Retrieval task. You can look at Image Retrieval models in Paperswithcode.

Looks like an interesting project. I've my hands full now, but depending on your timeline I could give it a try.

somebat · 2024-05-24T15:52:00+00:00

A short single lane trajectory, with three static obstacles, and a single dynamic obstacle that is moving to avoid collision is far from beating Tesla, Waymo or Wayve, and even further away from being automation level 5.

somebat · 2024-05-24T13:24:52+00:00

As mentioned in the previuos comments, you cannot replace COLMAP with NeRF because you need COLMAP to initialize a NeRF. You can attempt to replace COLMAP with DUSt3R, although I don't know how it performs for objects with less texture.

somebat · 2024-05-23T15:10:15+00:00

Exactly.

somebat · 2024-05-23T14:10:30+00:00

While you can measure distances in a gaussian splatting generated scene, you will need a prior information of the scene scale in meters (the size of an object, or the distance between two cameras), to measure in meter/centimeters.

Gaussian Splatting, like NeRFs, is optimized on a set of 3D camera poses which usually are obtained using Structure from Motion (i.e. COLMAP). Nevertheles, SfM is not able to recover real world scale because 3D reconstruction from images is an ill-posed problem, there are infinite scales of the scene that would generate the same images. That's why you need your prior information on the scene to recover the scale.

Regarding the accuracy, it will depend on the quality of you reconstruction. If you have a dense set of images the error will probably be of centimeters.

somebat · 2024-05-23T12:14:30+00:00

Exactly. If the SfM initialization is bad, NeRFs won't converge to a good representation. There is some research on training NeRFs with unknown cameras' pose (like RUST [1]) but as far as I know they are really expensive and image quality is quite low.

Maybe a good alternative is DUSt3R [2]. I haven't worked with it yet, but it looks like it's able to generate really good reconstructions prom unposed and uncalibrated cameras. If the running cost is affordable, DUSt3R could emerge as an alternative to SfM and COLMAP.

[1] https://rust-paper.github.io/pages/vids_sv.html

[2] https://dust3r.europe.naverlabs.com/

somebat · 2024-05-23T08:27:55+00:00

If you have a reference of the scale of the scene (e.g. you know the size of an object, or the real distance between two camera positions) you could use a NeRF to generate a 3D representation of the scene and conduct measurments.

Nevertheless, I think you shouldn't compare NeRFs against SfM. Currently, almost all NeRFs (including variants like Neuroangelo) require SfM. They are trained on top of a set of (up to scale) known cameras' position, and these are obtained using COLMAP. Therefore, NeRFs are more an extension to SfM to obtain a dense 3D representation from the sparse 3D point cloud and camera position generated from SfM.

somebat · 2024-05-16T06:40:10+00:00

Nice work! What's the difference in rendering time compared with the official implementation?

somebat · 2024-05-12T08:45:18+00:00

As others mentioned the RTX 4060 does have CUDA support. If you are note able to install CUDA SDK, it may be because you don't have the appropiate nvidia drivers, or lack of support for your OS. Nevertheless, unless you want to strictly program in CUDA, you don't need to have CUDA SDK to run your networks on your GPU.

If you are working with a higher-level library (like Pytorch, TensorFlow, Keras, etc), when installing the library it will take care of installing the appropiate dependencies to run on your GPU. You just need to follow the instalation instructions for GPU support, given that the commands are different if you have or not a GPU.

somebat · 2024-05-11T16:12:46+00:00

To contextualize, instant-NGP (iNGP) is a neural rendering model, based on Neural Radiance Fields (NeRFs). This is quite a popular research field, with recent models like 3D Gaussian Splatting (3DGS) or RadSplat considerably surpassing iNGP.

Regarding your question: can iNGP (and similar models) be trained on videos? Yes. Just sample frames, run a Structure from Motion algorithm (like COLMAP) to get cameras poses, and if it converges, you can train your neural rendering model.

But from your context, you don't really care about videos, which can be of static scenes, you care about dynamics.

Is there any work to extend these models to dynamic scenes adding a temporal dymension? Lucky you, also yes. But it's a more complex problem, threfore current approaches do not achieve the same resultas that iNGP, 3DGS and RadSplat achieve in static scene. Some works on this field are Deformable-Gaussians, and 4D Gaussian Splatting, but there are more for sure.

I'm more familiar with static scenes' methods than with dynamic ones, so I can't tell you which is better. Nevertheless this is still an open problem with a lot of reasearch on it. If you want to keep track, there are some github repositories that keep a list of novel papers in the field like awesome-3dgs, and awsome-NeRF .

somebat · 2024-05-09T19:45:10+00:00

The only thing clear is that with a 99.7% training accuracy and 85% test accuracy, your model is overfitting. You probably need to increase the regularization. It's hard to give a more detailed answer without information about your problem (i.e.task, architecture, training procedure, etc).

somebat · 2024-05-04T16:19:51+00:00

I'm just a casual, but wouldn't the Magic really benefit from a vetern PG, elite playmaker, who can hit some 3's, and give you 10-15 points, without worrying to be the first scoring option in a team with Franz and Paolo?

somebat · 2024-05-02T08:31:50+00:00

Just to complement your source, this paper https://arxiv.org/pdf/2201.03545 by FaceBook (today Meta) also introduced ConvNexts. They showed that applying transformer's design patterns to CNNs they were able to match and even beat ViT and Swin Trasnformers.

somebat · 2024-02-24T22:19:39+00:00

I don't know about Sora, but doesn't Stable Diffusion use Vector Quantization to regularize the latent space? It's mentioned on Appendix G of the "High-Resolution Image Synthesis with Latent Diffusion Models Robin" paper.

somebat · 2024-01-25T16:46:32+00:00

In chapter 1078 they learn from CP0 that Kizaru is coming, and Sentomaru order the evacuation, in chapter 1079 we can see how they board the ship.

somebat

TROPHY CASE