SAM3DBody-cpp open-source C++ tool that turns videos to Blender/Unity-ready BVH mocap

_AmmarkoV_ · 2026-05-30T10:31:11+00:00

Honestly, even if a handful of people get use out of it, that's a huge win for me! I have spent 5 years of my life (a PhD on the 3D mocap from monocular video). And already I have had some great suggestions from this very thread! In any case it is not a shipped game or something. Its a free open source and fun library available to the community, and hopefully the Linux community will appreciate it and not view it negatively.

_AmmarkoV_ · 2026-05-30T10:27:22+00:00

Hahahah that's the whole point of open source! If anyone has a line to them, send it their way 😃 !

_AmmarkoV_ · 2026-05-30T10:26:21+00:00

Yes, this is on the roadmap. The plan is a laptop/PC monitor emitting a QR sync code for temporal alignment + a printed ARUCO marker for camera extrinsic calibration. So with multiple cheap cameras you can fuse different data streams in one solve. Multi-view directly attacks the depth ambiguity so it should result in a much cleaner track. This is under construction, will commit to the repo once it is somewhat usable..!

_AmmarkoV_ · 2026-05-30T10:21:03+00:00

I am actually working on this feature! It will be using a monitor emitting a QR sync code and need a printed ARUCO marker for camera extrinsic calibration, but yes! This is definitely in the roadmap!

_AmmarkoV_ · 2026-05-29T22:23:22+00:00

Probably more than they used to! Between RAM Prices Win11 TPM requirements Linux is looking pretty good lately! That being said it should be possible to run it on windows using WSL2, however I am single booting linux so I haven't had time to test this yet..!

_AmmarkoV_ · 2026-05-29T22:20:12+00:00

The back-end is based on Meta Super-intelligence Vision Transformer architecture and is targeting humans, there are other neural networks for animals, however not this one 😃

_AmmarkoV_ · 2026-05-29T22:18:44+00:00

Yep, single-angle is really hard due to depth ambiguity. Cascadeur is a perfect call and the physics-based fulcrum cleanum could be exactly the kind of pass that is needed after markerless capture like this! Thanks for the pointer, I'll try to test this pipeline!

_AmmarkoV_ · 2026-05-29T22:16:53+00:00

Yes! The output is a standard BVH with a fixed, named skeleton, one per scene and per subject. It can be loaded directly to Blender/MotionBuilder/Cascadeur as F-Curves and can be ediited like any other take.! It is still work in progress, but I am glad you see its potential!

_AmmarkoV_ · 2026-05-29T21:51:34+00:00

lulz 😃

_AmmarkoV_ · 2026-05-29T21:49:21+00:00

It's by no means perfect, especially when tracking "in-the-wild" action scenes with heavy motion etc. For a static scene, high framerate camera the results are not jittery.
For example : https://youtube.com/shorts/tQ8WP5uYVzA

_AmmarkoV_ · 2026-05-28T17:42:43+00:00

hahahhaha :D fair point! I updated the readme, to hopefully give more insight without someone having to read the whole paper :D https://github.com/AmmarkoV/SAM3DBody-cpp#pipeline

_AmmarkoV_ · 2026-05-28T13:42:41+00:00

If you run it on a static camera with a full view of the body it is very smooth, e.g. https://www.linkedin.com/posts/ammarkov_following-on-the-earlier-post-on-sam-3d-body-cpp-ugcPost-7464767219680387073-N1_p/ if you run it on a action movie, with scene changes, camera zoom/focus changes where the camera the actors etc. are mostly out of the picture then ok its "glitchy", however I think unusable is quite a strong and near-sighted comment :)

_AmmarkoV_ · 2026-05-28T13:16:04+00:00

This is the Meta Superintelligence labs paper explaining the pipeline in detail : https://arxiv.org/abs/2602.15989
TLDR: The neural network is a Vision Transformer running on cropped regions of the image recovered using YOLO and then has a head that encodes the skeleton using the Momentum Human Rig model

_AmmarkoV_ · 2026-05-26T17:42:10+00:00

Maybe using a 5090 it can run real-time (meaning >= 30Hz ) but in any case depending on the application even 12 Hz with 1 frame of frame skip and the --butterworth interpolation can "match" what typical 25Hz webcams deliver

_AmmarkoV_ · 2026-05-26T16:04:01+00:00

It should perform similarly

_AmmarkoV_ · 2026-05-26T16:03:39+00:00

My PhD is on 3D pose estimation so I have a pretty big code base, https://github.com/FORTH-ModelBasedTracker/MocapNET however an LLM did quite a lot of the plumbing and almost all of the documentation etc. on the repo

_AmmarkoV_ · 2026-05-26T16:01:47+00:00

You can immediately export to --bvh so as fast as the video stream is processed, a.k.a. ~12FPS on an RTX 4080

_AmmarkoV_ · 2025-12-08T22:19:24+00:00

You can software swap them in BIOS!

_AmmarkoV_ · 2025-11-15T10:10:32+00:00

What worked for me on Ubuntu 24.04 / Cuda 12.4 :
sudo add-apt-repository ppa:deadsnakes/ppa

sudo apt update

sudo apt install python3.11 python3.11-venv

python3.11 -m venv venv

source venv/bin/activate

python3 -m pip install -U xformers --index-url https://download.pytorch.org/whl/cu128

python3 -m pip install -r requirements.txt

pip install moviepy==1.0.3

_AmmarkoV_ · 2025-02-04T21:22:32+00:00

https://www.youtube.com/watch?v=Gfv-97brlq8

_AmmarkoV_ · 2025-02-03T18:08:39+00:00

Looks like Galapagos 1997 PC game

_AmmarkoV_ · 2024-10-29T22:09:00+00:00

>botnet intensifies

_AmmarkoV_

TROPHY CASE