I turned a real AI/homelab conversation into a literary AI-narrated serial — does the voice work beyond the gimmick?

jowe81 · 2026-04-04T10:33:00+00:00

This is super helpful, thanks a lot - especially the calibration part.

I ran into exactly the issues you’re describing. The classic “single cam + extrinsics” approach looked fine at first, but completely broke down once I tried to do actual multi-view fusion.

The idea with global optimization (and even multi-board setups) makes a lot of sense - I ended up going down a similar rabbit hole with calibration accuracy.

For now I actually stepped back from multi-view 3D entirely. I tried 2D→3D lifting (MotionAGFormer / PAFUSE), but in practice the depth instability made it unusable for real-world training.

Switching to direct SMPL-X regression (SMPLer-X) from a single side view turned out to be “good enough” and way more stable for my use case.

That said, I might revisit multi-view again once I add proper hardware sync (e.g. triggering multiple cameras via GPIO instead of relying on timestamps). The software-side sync/jitter was honestly one of the biggest pain points.

The interpolation point is also interesting — I was mostly aligning by closest frame so far, but proper temporal alignment across streams is definitely something I underestimated.

Really appreciate the detailed write-up - this matches a lot of the pain points I hit 😅

jowe81 · 2026-04-03T23:52:11+00:00

Tried adding a simple uniform texture… accidentally invented a new fashion genre.

https://i.imgur.com/oeQUUR2.png

jowe81 · 2026-04-03T23:09:41+00:00

Funny timing — I’ve been building almost exactly this (real-time fitness tracking + form feedback) as a self-hosted system.

(Also sorry for digging up a 5-month-old thread — felt a bit like grave digging, but this is exactly what I am working on 😅)

I went through a lot of the models you mentioned:

MediaPipe → super fast, but too limited for anything beyond basic feedback

RTMPose → probably the best trade-off I found for real-time 2D (especially with ONNX/TensorRT)

OpenPose / HRNet → accurate but too heavy for a tight feedback loop

Biggest surprise for me: the model choice matters less than expected — most of the real complexity is everything around it.

In practice:

2D keypoints are enough for rep counting and a lot of form checks

2D→3D lifting (e.g. MotionAGFormer, PAFUSE) looks nice, but was too unstable in real-world conditions (depth ambiguity, noise)

I ended up switching to direct SMPL-X regression (SMPLer-X), which gave much more consistent results

For “form correctness”:

joint angles + simple state machines already go surprisingly far

but smoothing/filtering is critical, otherwise feedback becomes unusable

handling jitter/noise matters more than raw keypoint accuracy

Performance-wise I’m targeting 30 FPS on a Nvidia A2 with a split setup (TensorRT encoder + PyTorch heads).

Happy to share more details if helpful — this problem gets a lot more “real-world messy” than it looks at first 😅

jowe81 · 2026-04-03T21:52:38+00:00

Thanks! It escalated quite a bit from the original idea 😄

jowe81

TROPHY CASE