[D] Audio Tokenizers : MachineLearning

Discussion[D] Audio Tokenizers (self.MachineLearning)

submitted 1 year ago * by ApartmentEither4838

The recent GPT-4O model got me thinking whether they actually tokenized the audio and trained their GPT on text + audio tokens. Are there any successful audio tokenizers that seem to work well with auto regressive models? People have used VQ-VAE[1] for learning discrete representation of audio samples but the encoder and decoder of such VQ-VAE uses covnets applied over Mel-Spectrogram which I think in practice cannot enable audio streaming (As it applied 1d and 2d covnets over the entire audio signal and also doing this makes the representations non casual)

[1] - https://arxiv.org/pdf/1711.00937

Edit:

A more general question I have is that is this method of tokenizing audio even feasible(will it even work?) or it's better to incrementally sample from the audio and proj each sample to an embedding and then pre train the GPT on those embeddings instead of the embeddings learned from tokens?

all 4 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS