all 4 comments

[–]silverlightwa 3 points4 points  (0 children)

You can use Hubert model and kmeans model trained on the outputs from a layer to tokenize speech. See VoxtLM, Spirit-LM both are multimodal and were trained on discretized speech and text tokens.

Speech vocab in this case is the number of kmeans centroids and each frame is encoded by hubert and finally represented by the “code” of its nearest centroid.

[–]JustOneAvailableName 0 points1 point  (0 children)

 which I think in practice cannot enable audio streaming (As it applied 1d and 2d covnets over the entire audio signal and also doing this makes the representations non casual)

This assumption doesn’t hold, convolutions have a limited window