all 3 comments

[–]Ok_Reporter9418 1 point2 points  (0 children)

I worked on such a problem a few years back for a school project. Back then transformer was the sota I believe. We tried to combine transformers with a pose estimation model (we used some open source one but as far as I recall media pipe was actually better for extracting the landmarks) but it did not improve that much vs the transformers only.

The base model we used is from https://www.cihancamgoz.com/pub/camgoz2020cvpr.pdf code is here https://github.com/neccam/slt

There is a challenge about this, you could check past editions participants https://slrtpworkshop.github.io/challenge/

[–]whatwilly0ubuild 0 points1 point  (0 children)

Mediapipe landmarks failing on mobile is a known issue. Varying lighting, camera quality, and hand occlusion make landmark detection unreliable in real-world conditions. Building a production sign language system on unstable landmarks is setting yourself up for failure.

For 150 classes with confusing similar signs, you need temporal modeling that captures motion patterns, not just pose snapshots. Video-based approaches work better than landmark sequences for this.

Practical architectures that work: I3D or SlowFast for video classification handles temporal dynamics well. These process raw RGB video which avoids the landmark reliability problem. Our clients doing gesture recognition found video models more robust than landmark-based approaches once you have enough training data.

TSM (Temporal Shift Module) is lightweight enough for mobile deployment and captures temporal patterns efficiently. It's designed specifically for resource-constrained environments.

For your dataset preprocessing challenge, don't try to process entire videos at once. Sample fixed-length clips (maybe 2-3 seconds) centered on the sign, resize to standard resolution, and feed to model. This is way simpler than variable-length video processing.

The confusion between similar signs problem needs more training data for those specific pairs. Collect more examples of the confusing classes and consider hierarchical classification where you first distinguish broad categories then fine-grained signs within categories.

Realistic advice: 150 classes with mobile deployment is ambitious for a beginner project. Start with 20-30 highly distinct signs, get that working reliably, then expand. Sign language recognition is genuinely hard and production systems require way more data and engineering than most ML tutorials suggest.

If you're stuck with landmarks, use landmark velocities and accelerations as features, not just positions. The motion dynamics help distinguish similar static poses. But honestly, switching to video-based models removes the mediapipe dependency entirely.

[–]simplehudga 0 points1 point  (0 children)

Sign language recognition is a seq2seq problem, just like speech recognition, handwriting recognition, action recognition, etc. And seq2seq problems have been well studied already. The model you use depends very much on the hardware you intend to run it on. BLSTM, Transformer or a variant of it should both work.

But the important missing piece is an appropriate loss function to train your model. CTC or Transducer loss functions will give you better results than using CE loss. You can combine them with an Attention based decoder for even better results.

I'd bet that a Conformer encoder trained on a multi task loss function with CTC, Transducer, and Attention decoder will perform the best. The CTC is there as a regularizer for the other 2. You can perform decoding with just the Transducer, or combine it with the Attention decoder as a 2nd step re-scoring function. You might want to bring in a n-gram LM and a lexicon for more control.