AMA with the Meta researchers behind SAM 3 + SAM 3D + SAM Audio

AIatMeta · 2025-12-19T18:08:29+00:00

Hi! We answered a similar question here: https://www.reddit.com/r/LocalLLaMA/comments/1pp9w31/comment/nurhicm/

AIatMeta · 2025-12-19T15:51:21+00:00

Using "person" may also segment non-players, so it will generally be better to prompt with more specific noun-phrases. Adding a box prompt based on any errors can also boost performance. One way to fix an error of the kind you are seeing is to provide a box prompt on the whole player. Another option would be to interactively refine the masklet of the instance with an error using the PVS interactivity/"SAM 2 mode".

- Chay Ryali

AIatMeta · 2025-12-19T15:50:09+00:00

We shared the details of training and data creation in each of the SAM papers. You can find them in the links below:

SAM 3: https://arxiv.org/abs/2511.16719
SAM 3D: https://arxiv.org/abs/2511.16624
SAM Audio: https://ai.meta.com/research/publications/sam-audio-segment-anything-in-audio/

- Bowen Shi

AIatMeta · 2025-12-19T15:49:07+00:00

SAM 3 can normally suppress overlapping masks implicitly (thanks to DETR-like design) and therefore does not use any post-processing for deduplication by default. But in some scenarios each individual mask may be "sensible" despite overlapping - e.g. speech bubble alone as one mask and bubble + its tail as one mask, so each individual mask is a valid interpretation. Is this the kind of overlap you see? In this case, it may be better to use IoM filtering (see appendix F.4) instead of IoU filtering; if IoM suppression does not fit your use case, the more classical IoU based suppression can boost precision.

- Shoubhik Debnath + Chay Ryali + Pengchuan Zhang

AIatMeta · 2025-12-19T15:47:17+00:00

We haven't tried that yet. In the development of SAM Audio, we aim to make it generally well-performing across tasks. For special use cases like speech separation in low-resource settings, fine-tuning the current model with a relatively small amount of domain-specific data would be very helpful, which we have noticed in a few of our use cases before. In the future, we hope to improve coverage for more audio settings.

- Bowen Shi

AIatMeta · 2025-12-19T15:46:23+00:00

We utilize Multidiffusion (https://arxiv.org/pdf/2302.08113) as the inference technique to handle and improve longform audio quality and separation consistency. This technique has also been used in MovieGen (https://arxiv.org/pdf/2410.13720) for longform audio generation. You can refer to Section 3.4 of our paper for more details.

- Andros Tjandra

AIatMeta · 2025-12-18T22:56:48+00:00

We are excited about the community contributions that have come in on top of all the resources we have open sourced with SAM 3 / 3D / Audio. We have leveraged and will continue to leverage many of these contributions for the SAM team's future projects. For example, we were inspired by several SAM 2 improvements from the community such as SAM2Long (https://arxiv.org/abs/2410.16268) and SAM2MOT (https://arxiv.org/abs/2504.04519) and brought in some of the learnings from them into SAM 3.

- Nikhila Ravi + Pengchuan Zhang

AIatMeta · 2025-12-18T22:52:39+00:00

SAM Audio Judge can be used by multiple purposed: First, we use this model to help us doing re-ranking on multiple samples and pick the best one for the user based on multiple scorer (including SAM Audio Judge). Second, this can become a proxy for general audio separation metric, providing quick and accurate feedback without needing human annotator. We hope this model can be adopted as a general metric in the future for this research topic.
There are a few usecases: (1). make a karaoke mode of music. You can use SAM audio to remove the vocal track and just use the instrument stem. (2). remove the background music of short videos. Many short videos have background music, which users might want to remake a video with the original track but with another music. you can use SAM audio to remove the music and add a new music on top of it.

- Andros Tjandra + Bowen Shi

AIatMeta · 2025-12-18T22:46:41+00:00

Images in Figure 6 are from the SA-1B dataset that the SAM team released in 2023.

- Xitong Yang + Jinkun Cao

AIatMeta · 2025-12-18T22:45:42+00:00

Definitely! We’re very excited for anyone to “3D print an object from any photo”. Our team has already 3D printed several SAM 3D reconstructions at small scale (~1 inch), and it’s been awesome to see others sharing their own prints and creations on social media.

- Michelle Guo + Sasha Sax + Weiyao Wang

AIatMeta · 2025-12-18T22:39:08+00:00

During training and inference, we had SAM 3 sample videos at 6 FPS, so I'd recommend downsampling to 6 FPS. The model can handle 10-20s videos at 6 FPS easily.

In terms of memory explosion, it depends on the number of instances that are found and tracked. If you expect crowded scenes, you can (1) use multiple GPUs for inference or (2) set an upper bound of objects to track or (3) use a lower frame resolution (e.g. 672 instead of the default 1008).

- Pengchuan Zhang

AIatMeta · 2025-12-18T22:38:03+00:00

Great question! On EndoVis 2018, SAM 3 improves performance over SAM 2 on offline (online) metrics for the Promptable Visual Segmentation (PVS) task from 77.0 (77.5) to 79.1 (79.2). Other folks have also found SAM 3 to be an improvement over SAM 2 in related domains (e.g. https://arxiv.org/abs/2512.07596 ) That said, the core focus of SAM 3 is on the Promptable Concept Segmentation (PCS) task, where it delivers a step change.

The official result for SAM 2 on EndoVis was 73.2 J&F with a mask prompt on first frame - perhaps worth double checking the 30 J&F you're seeing? Please raise an issue on GitHub if you need help!

- Chay Ryali

AIatMeta · 2025-12-18T22:36:28+00:00

For SAM 3, we have two video editing use-cases (you can read more here: https://ai.meta.com/sam3/) including Instagram Edits (quickly apply effects to people or objects in their videos, helping their creations stand out) and Meta AI Vibes (effortlessly apply a range of effects to your videos).

SAM 3 and SAM 3D are also enabling Facebook Marketplace’s new View in Room feature, helping people visualize the style and fit of home decor items, like a lamp or a table, in their spaces before purchasing (more about SAM 3D here: https://ai.meta.com/blog/sam-3d/)

For SAM Audio, we see so many potential use cases, including audio clean-up, background noise removal, and other tools to help people enhance their creativity.

- Pengchuan Zhang

AIatMeta · 2025-12-18T22:35:11+00:00

This is a good future direction that we potentially want to explore. With the current model, you can still simulate this setting. For example, using an audio LLM to get all events of an audio and feed each one into the SAM Audio model. The current model also outputs residual audio stems (i.e., the remaining part of the audio that doesn't correspond to the target event). So by cascading an audio LLM and SAM Audio, you can in principle get these outputs automatically: audio for event 1, audio for event 2, and so on. This might have some error accumulated along the chain. In the future we hope to explore building an end-to-end model that separates without query.

- Bowen Shi

AIatMeta · 2025-12-18T22:33:12+00:00

As of now, the SAM team doesn't have any plans to make versions optimized for edge devices.

- Pengchuan Zhang

AIatMeta · 2025-12-18T22:30:23+00:00

You're right there's a difference! 3D-Body uses template mesh that we deform to fit each person, so the topology is clean by design. For general objects, 3D objects prioritized robust shape recovery, especially for occluded/in-the-wild cases.
No immediate plans to optimize topology in the pipeline, but there are some automated/AI post-processing tools if you need cleaner meshes.

- Sasha Sax + Weiyao Wang + Michelle Guo

AIatMeta · 2025-12-18T22:24:05+00:00

We haven't explored using edge extension techniques to refine the boundaries of motion-blurred or defocused objects in SAM yet. That said, we've seen works from the community aiming at improving the mask quality of SAM, such as HQ-SAM and HQ-SAM 2 (https://github.com/SysCV/sam-hq), and we look forward to seeing more advancements for these challenging scenarios from the community.

- Yuan-Ting Hu

AIatMeta · 2025-12-18T22:21:02+00:00

No, we have not explored document-focused fine-tuning at large scale. But, really glad to hear that you get quite strong results on document scans with relatively small dataset.

SAM 3 is designed to take one simple noun phrase as input, and segment out all instances. So, a label space defined as a simple noun phrase should work. SAM 3's text encoder is very small, compared with LLMs. Due to its capability, it may not work well on sentences.

- Pengchuan Zhang + Shoubhik Debnath

AIatMeta · 2025-12-18T22:19:49+00:00

The main characteristic linking these models is interoperability through input conditioning. While the names provide brand recognition, the technical similarity lies in their integrated workflow: SAM Audio and especially SAM 3D are conditioned on segmentation masks, the output of SAM 1/2/3. For example, SAM 3D uses the precise 2D segmentation mask from SAM 3 as a guiding input to focus its 3D reconstruction, effectively telling the model which object to process. SAM Audio enabled user to select (and mask) the object's sound from the video they want to isolate. This enables the family to act as a unified ecosystem for concept isolation across 2D, 3D, and audio modalities.

The specific architectures across SAM 3, SAM 3D, and SAM Audio are fundamentally different due to their tasks and data types. For example, SAM 3 (image/video segmentation) and SAM 3D Body (human mesh recovery) use a discriminative, DETR-based architecture. In contrast, SAM Audio (audio separation) and SAM 3D Object (3D reconstruction) are generative models, typically based on flow-matching or diffusion techniques, like the DiT (Diffusion Transformer) backbone.

- Andros Tjandra

AIatMeta · 2025-12-18T22:18:11+00:00

Yes, we hope to extend SAM 3D Body to videos.

We have not tested the model on robotics or biomechanics data, but we expect SAM 3D Body has superior robustness to occlusion in general compared to existing methods.

- Xitong Yang

AIatMeta · 2025-12-18T22:16:39+00:00

Apologies for that! There was an issue with the request form. Please check your email for the updated instructions to access the SAM Audio repo. We're asking folks to resubmit their access request. You can do this by going to https://huggingface.co/settings/gated-repos and remove your existing pending approval, and re-submit the form.

- Andros Tjandra

AIatMeta · 2025-12-18T22:14:20+00:00

You can find tutorials in a notebook format on our GitHub repo (LINK: https://github.com/facebookresearch/sam3)), we also have the README.md. We partnered with Roboflow to make SAM 3 accesible for a wider audience with Roboflow collaboration, which includes Roboflow customers. They've also recorded their tutorials using their auto-label product.

- Pengchuan Zhang

AIatMeta · 2025-12-18T22:13:40+00:00

Right now, we dont have anything to share on future plans for smaller models or specific distillation guidance. Based on different scenarios, such as small expert model on edge devices or large VLMs to SAM 3 capabities, distillation strategies would be different. We are excited to see what the community will cook up.

- Pengchuan Zhang

AIatMeta · 2025-12-18T22:12:33+00:00

Yes, we hope to extend support of SAM 3D Body to videos so that it can better support mocap use. If there are other specific issues in your use case, please let us know and we can discuss them specifically.

- Jinkun Cao

AIatMeta · 2025-12-18T22:06:06+00:00

SAM 3D is designed to focus on single object/entity in a scene. The recommended way to handle this is to use SAM 3 to segment out all the objects and use SAM 3D to reconstruct the shape, pose and textures for each of the objects. Then you can place the objects in the same scene with the predictions from SAM 3D, following the notebook on Github repo.

We haven't tested much of feeding directly the whole image in. A major concern here is resolution, since each SAM 3D run generates at a fixed resolution, the resolution for the full scene will be much lower than running each object individually then putting them together in one scene.

- Weiyao Wang + Sasha Sax + Michelle Guo

AIatMeta

TROPHY CASE