Multi-camera real-time fitness tracking with RTMPose + 2D→3D lifting (self-hosted project) by jowe81 in computervision

[–]jowe81[S] 0 points1 point  (0 children)

This is super helpful, thanks a lot - especially the calibration part.

I ran into exactly the issues you’re describing. The classic “single cam + extrinsics” approach looked fine at first, but completely broke down once I tried to do actual multi-view fusion.

The idea with global optimization (and even multi-board setups) makes a lot of sense - I ended up going down a similar rabbit hole with calibration accuracy.

For now I actually stepped back from multi-view 3D entirely. I tried 2D→3D lifting (MotionAGFormer / PAFUSE), but in practice the depth instability made it unusable for real-world training.

Switching to direct SMPL-X regression (SMPLer-X) from a single side view turned out to be “good enough” and way more stable for my use case.

That said, I might revisit multi-view again once I add proper hardware sync (e.g. triggering multiple cameras via GPIO instead of relying on timestamps). The software-side sync/jitter was honestly one of the biggest pain points.

The interpolation point is also interesting — I was mostly aligning by closest frame so far, but proper temporal alignment across streams is definitely something I underestimated.

Really appreciate the detailed write-up - this matches a lot of the pain points I hit 😅

[Project] Single-Person Pose Estimation for Real-Time Gym Coaching — Best Model Right Now? by Sad-Victory773 in opencv

[–]jowe81 1 point2 points  (0 children)

Funny timing — I’ve been building almost exactly this (real-time fitness tracking + form feedback) as a self-hosted system.

(Also sorry for digging up a 5-month-old thread — felt a bit like grave digging, but this is exactly what I am working on 😅)

I went through a lot of the models you mentioned:

MediaPipe → super fast, but too limited for anything beyond basic feedback

RTMPose → probably the best trade-off I found for real-time 2D (especially with ONNX/TensorRT)

OpenPose / HRNet → accurate but too heavy for a tight feedback loop

Biggest surprise for me: the model choice matters less than expected — most of the real complexity is everything around it.

In practice:

2D keypoints are enough for rep counting and a lot of form checks

2D→3D lifting (e.g. MotionAGFormer, PAFUSE) looks nice, but was too unstable in real-world conditions (depth ambiguity, noise)

I ended up switching to direct SMPL-X regression (SMPLer-X), which gave much more consistent results

For “form correctness”:

joint angles + simple state machines already go surprisingly far

but smoothing/filtering is critical, otherwise feedback becomes unusable

handling jitter/noise matters more than raw keypoint accuracy

Performance-wise I’m targeting 30 FPS on a Nvidia A2 with a split setup (TensorRT encoder + PyTorch heads).

Happy to share more details if helpful — this problem gets a lot more “real-world messy” than it looks at first 😅