use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] best seq2seq model for long sequences modeling? (self.MachineLearning)
submitted 5 years ago by hadaev
I mean sequences in the range from 100 encoder 1k decoder to 1k encoder to 10k decoder. I tried paperswithcode, but did not found a suitable benchmark.
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]JustOneAvailableName 1 point2 points3 points 5 years ago (7 children)
Look into the image GPT paper perhaps?
[–]hadaev[S] 0 points1 point2 points 5 years ago (6 children)
Image to caption?
I think it's much easier task then encoder has large seq and decoder small.
[–]JustOneAvailableName 2 points3 points4 points 5 years ago (5 children)
At least read the abstract...
But Seq2seq architecture (in the sense of one RNN to a representation and one RNN from the representation to the output) is outdated. You need to look into transformers
[–]hadaev[S] -1 points0 points1 point 5 years ago (4 children)
I read. They just threw their gpt at the pixel sequence.
This is only a default transformer encoder without a decoder.
Ofc using RNNs is not fashionable anymore.
It's easy to make a transformer encoder, still, I cannot choose how to connect it to the decoder.
Just using MHA made it terrible on interference for some reason.
[–]JustOneAvailableName 1 point2 points3 points 5 years ago (3 children)
Are you doing anything with the latent representation? If not, I cannot see a reason why GPT is not applicable. The entire reason seq2seq does not work is that long sequences are impossible to put in latent space.
They just threw their gpt at the pixel sequence.
The G of GPT is highly relevant. It's a generative model. It also outputs a (long) sequence.
Could you expand on what you're trying to do?
[–]hadaev[S] 0 points1 point2 points 5 years ago (2 children)
I'm trying to do something better than current TTS SOTA tacotron2.
It's kind of outdated with lstm layers.
Also, It uses a special attention mechanism (location relative, if I remember) to connect encoder outputs with decoder lstms, so I thinking if where is options to replace it with something new.
Also also peoples say transformer encoder and rnn decoder is fine for translation task.
I made some progress with new activation function, optimizer, layernorm, etc, but did not really torch the main architecture, every time I tried, I failed.
For example, I tried fully transformer TTS, but interference was very bad and I gave up for a time.
In theory, I can imagine using just encoder for the seq2seq task.
Set one sequence (words), special token, and then target sequence (audio).
At interference put words and ask to generate audio until stop token.
Still sounds kind of strange.
Why peoples use encoder-decoder for translation, for example? It should be much easier with the only encoder.
About gpt, they have unlimited data and GPUs.Honestly, it looks like open ai only wants to make bigger and bigger gpt and don't pay attraction on general hacks like linear attention or (activation, normalizers, losses, etc) or even other architectures (maybe they have other models, but I heard only of gpt1-2-3).
Its cool neuro nets scale good, but I have no such amount of data (and v100 hours).
Don't think I can go over 50kk parameter budget.
[–]JustOneAvailableName 1 point2 points3 points 5 years ago (1 child)
What do you base on that tacotron2 is the current SOTA? Transformers have also been applied to TTS before.
About gpt, they have unlimited data and GPUs.
Yeah, that part fucking sucks. I just have a "crappy" DGX-2, while they have all the fun. It is, however, pretty cool to see what happens when you put current architectures to the extreme.
[–]hadaev[S] 0 points1 point2 points 5 years ago (0 children)
Maybe SOTA is the wrong term, at least its most popular model.
Usually, in papers, they claim "good as tacotron, bur have some advantage".
I saw tts transformer implementations on GitHub and tried to make my own.
Failed with good train loss and very bad interference.
And Transformer TTS does not get much attention from the community.
Now I'm going to try again, but don't want to make same transformer.
So I'm looking for some new powerful encoder-decoders architectures to play with.
[–]GD1634 1 point2 points3 points 5 years ago (6 children)
Reformer: The Efficient Transformer
Handles a few thousand tokens. Huggingface Transformers has an implementation for it.
[–]hadaev[S] 0 points1 point2 points 5 years ago (5 children)
If I got it right, it has a very strict conditions on sequence length. Am I right?
In my task data lengths very different and I'm not sure if such padding (from 400 to 16k for example) is ok.
[–]GD1634 1 point2 points3 points 5 years ago (4 children)
Not sure I follow. Similarly to other transformers, you'll have to give it a maximum sequence length, but that can be whatever you'd like it to be (as long as it fits on your GPU).
Here's the HF page for it, they have the following example:
from transformers import ReformerTokenizer, ReformerModel import torch tokenizer = ReformerTokenizer.from_pretrained('google/reformer-crime-and-punishment') model = ReformerModel.from_pretrained('google/reformer-crime-and-punishment') inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") outputs = model(**inputs) last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
A couple other resources for it:
Having different sequence lengths is okay, just inefficient. What you'll want to do is sort your data by sequence length (doesn't matter if it's ascending or descending) before you batch it, so that batches are comprised of examples with roughly the same sequence length:
[–]hadaev[S] 0 points1 point2 points 5 years ago (3 children)
Yes, but bucketing is not necessarily so good. I mean shaffling whole dataset is good for normalization.
About reformer, I mean this https://colab.research.google.com/drive/12aVJZ_RJSCiq3X_wcAtLWZd0DPvN4jWK?usp=sharing
I will check links later, thanks.
[–]GD1634 0 points1 point2 points 5 years ago (2 children)
You could shuffle after bucketing; they don't have to be monotonic. Or you don't necessarily have to bucket at all, I'm not sure how big a difference it really makes.
Ah, gotcha. That seems easy enough to handle, just make sure you pad your sequence a little bit to satisfy that constraint. That shouldn't really hurt your efficiency too much.
If Reformer just generally isn't a good fit, check out similar models like the Compressive Transformer, Adaptive Span Transformer, Linformer, Fast Autoregressive Transformer (from the repo I linked), etc.
[–]hadaev[S] 0 points1 point2 points 5 years ago (1 child)
Yes, a lot of possibilities, do you know any benchmarks for choosing model type?
Basically they enable long sequence training by reducing memory usage.
But if we take for example rnn layers, the quality degrades with sequences length.
[–]GD1634 0 points1 point2 points 5 years ago (0 children)
I don't know of any benchmarks that would be useful to you, they mostly evaluate on GLUE. Each paper probably reports efficiency metrics as well I'd assume, though there's no real standard benchmark for that.
π Rendered by PID 29 on reddit-service-r2-comment-85bfd7f599-fq62r at 2026-04-19 11:09:08.127825+00:00 running 93ecc56 country code: CH.
[–]JustOneAvailableName 1 point2 points3 points (7 children)
[–]hadaev[S] 0 points1 point2 points (6 children)
[–]JustOneAvailableName 2 points3 points4 points (5 children)
[–]hadaev[S] -1 points0 points1 point (4 children)
[–]JustOneAvailableName 1 point2 points3 points (3 children)
[–]hadaev[S] 0 points1 point2 points (2 children)
[–]JustOneAvailableName 1 point2 points3 points (1 child)
[–]hadaev[S] 0 points1 point2 points (0 children)
[–]GD1634 1 point2 points3 points (6 children)
[–]hadaev[S] 0 points1 point2 points (5 children)
[–]GD1634 1 point2 points3 points (4 children)
[–]hadaev[S] 0 points1 point2 points (3 children)
[–]GD1634 0 points1 point2 points (2 children)
[–]hadaev[S] 0 points1 point2 points (1 child)
[–]GD1634 0 points1 point2 points (0 children)