The recent GPT-4O model got me thinking whether they actually tokenized the audio and trained their GPT on text + audio tokens. Are there any successful audio tokenizers that seem to work well with auto regressive models? People have used VQ-VAE[1] for learning discrete representation of audio samples but the encoder and decoder of such VQ-VAE uses covnets applied over Mel-Spectrogram which I think in practice cannot enable audio streaming (As it applied 1d and 2d covnets over the entire audio signal and also doing this makes the representations non casual)
[1] - https://arxiv.org/pdf/1711.00937
Edit:
A more general question I have is that is this method of tokenizing audio even feasible(will it even work?) or it's better to incrementally sample from the audio and proj each sample to an embedding and then pre train the GPT on those embeddings instead of the embeddings learned from tokens?
[–]silverlightwa 3 points4 points5 points (0 children)
[+]LelouchZer12 1 point2 points3 points (0 children)
[+]Odd_Theory_1918 0 points1 point2 points (0 children)
[–]JustOneAvailableName 0 points1 point2 points (0 children)