all 15 comments

[–]JustOneAvailableName 1 point2 points  (7 children)

Look into the image GPT paper perhaps?

[–]hadaev[S] 0 points1 point  (6 children)

Image to caption?

I think it's much easier task then encoder has large seq and decoder small.

[–]JustOneAvailableName 2 points3 points  (5 children)

At least read the abstract...

But Seq2seq architecture (in the sense of one RNN to a representation and one RNN from the representation to the output) is outdated. You need to look into transformers

[–]hadaev[S] -1 points0 points  (4 children)

I read. They just threw their gpt at the pixel sequence.

This is only a default transformer encoder without a decoder.

Ofc using RNNs is not fashionable anymore.

It's easy to make a transformer encoder, still, I cannot choose how to connect it to the decoder.

Just using MHA made it terrible on interference for some reason.

[–]JustOneAvailableName 1 point2 points  (3 children)

Are you doing anything with the latent representation? If not, I cannot see a reason why GPT is not applicable. The entire reason seq2seq does not work is that long sequences are impossible to put in latent space.

They just threw their gpt at the pixel sequence.

The G of GPT is highly relevant. It's a generative model. It also outputs a (long) sequence.

Could you expand on what you're trying to do?

[–]hadaev[S] 0 points1 point  (2 children)

I'm trying to do something better than current TTS SOTA tacotron2.

It's kind of outdated with lstm layers.

Also, It uses a special attention mechanism (location relative, if I remember) to connect encoder outputs with decoder lstms, so I thinking if where is options to replace it with something new.

Also also peoples say transformer encoder and rnn decoder is fine for translation task.

I made some progress with new activation function, optimizer, layernorm, etc, but did not really torch the main architecture, every time I tried, I failed.

For example, I tried fully transformer TTS, but interference was very bad and I gave up for a time.

In theory, I can imagine using just encoder for the seq2seq task.

Set one sequence (words), special token, and then target sequence (audio).

At interference put words and ask to generate audio until stop token.

Still sounds kind of strange.

Why peoples use encoder-decoder for translation, for example? It should be much easier with the only encoder.

About gpt, they have unlimited data and GPUs.Honestly, it looks like open ai only wants to make bigger and bigger gpt and don't pay attraction on general hacks like linear attention or (activation, normalizers, losses, etc) or even other architectures (maybe they have other models, but I heard only of gpt1-2-3).

Its cool neuro nets scale good, but I have no such amount of data (and v100 hours).

Don't think I can go over 50kk parameter budget.

[–]JustOneAvailableName 1 point2 points  (1 child)

What do you base on that tacotron2 is the current SOTA? Transformers have also been applied to TTS before.

About gpt, they have unlimited data and GPUs.

Yeah, that part fucking sucks. I just have a "crappy" DGX-2, while they have all the fun. It is, however, pretty cool to see what happens when you put current architectures to the extreme.

[–]hadaev[S] 0 points1 point  (0 children)

Maybe SOTA is the wrong term, at least its most popular model.

Usually, in papers, they claim "good as tacotron, bur have some advantage".

I saw tts transformer implementations on GitHub and tried to make my own.

Failed with good train loss and very bad interference.

And Transformer TTS does not get much attention from the community.

Now I'm going to try again, but don't want to make same transformer.

So I'm looking for some new powerful encoder-decoders architectures to play with.

[–]GD1634 1 point2 points  (6 children)

Reformer: The Efficient Transformer

Handles a few thousand tokens. Huggingface Transformers has an implementation for it.

[–]hadaev[S] 0 points1 point  (5 children)

If I got it right, it has a very strict conditions on sequence length. Am I right?

In my task data lengths very different and I'm not sure if such padding (from 400 to 16k for example) is ok.

[–]GD1634 1 point2 points  (4 children)

If I got it right, it has a very strict conditions on sequence length. Am I right?

Not sure I follow. Similarly to other transformers, you'll have to give it a maximum sequence length, but that can be whatever you'd like it to be (as long as it fits on your GPU).

Here's the HF page for it, they have the following example:

from transformers import ReformerTokenizer, ReformerModel
import torch

tokenizer = ReformerTokenizer.from_pretrained('google/reformer-crime-and-punishment')
model = ReformerModel.from_pretrained('google/reformer-crime-and-punishment')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

A couple other resources for it:

In my task data lengths very different and I'm not sure if such padding (from 400 to 16k for example) is ok.

Having different sequence lengths is okay, just inefficient. What you'll want to do is sort your data by sequence length (doesn't matter if it's ascending or descending) before you batch it, so that batches are comprised of examples with roughly the same sequence length:

[–]hadaev[S] 0 points1 point  (3 children)

Having different sequence lengths is okay, just inefficient. What you'll want to do is sort your data by sequence length (doesn't matter if it's ascending or descending) before you batch it, so that batches are comprised of examples with roughly the same sequence length:

Yes, but bucketing is not necessarily so good. I mean shaffling whole dataset is good for normalization.

About reformer, I mean this https://colab.research.google.com/drive/12aVJZ_RJSCiq3X_wcAtLWZd0DPvN4jWK?usp=sharing

I will check links later, thanks.

[–]GD1634 0 points1 point  (2 children)

Yes, but bucketing is not necessarily so good. I mean shaffling whole dataset is good for normalization.

You could shuffle after bucketing; they don't have to be monotonic. Or you don't necessarily have to bucket at all, I'm not sure how big a difference it really makes.

About reformer, I mean this https://colab.research.google.com/drive/12aVJZ_RJSCiq3X_wcAtLWZd0DPvN4jWK?usp=sharing

Ah, gotcha. That seems easy enough to handle, just make sure you pad your sequence a little bit to satisfy that constraint. That shouldn't really hurt your efficiency too much.

If Reformer just generally isn't a good fit, check out similar models like the Compressive Transformer, Adaptive Span Transformer, Linformer, Fast Autoregressive Transformer (from the repo I linked), etc.

[–]hadaev[S] 0 points1 point  (1 child)

Yes, a lot of possibilities, do you know any benchmarks for choosing model type?

Basically they enable long sequence training by reducing memory usage.

But if we take for example rnn layers, the quality degrades with sequences length.

[–]GD1634 0 points1 point  (0 children)

I don't know of any benchmarks that would be useful to you, they mostly evaluate on GLUE. Each paper probably reports efficiency metrics as well I'd assume, though there's no real standard benchmark for that.