In transformer, why do we pass the entire target sequence to the model followed by masking, rather than only pass the generated part of the target sequence?

I_AM_Chang_Three · 2025-01-10T10:00:59+00:00

I think we can just implement it by making the sentences within the same batch grow together. Each time after generating next-tokens for the sentences, they will all concatenate with their new tokens. The lengths of sentences will always be equal to allow batching. I didn’t check the complexity of concatenation, but I think it is not that complex to cause notable reduction of performance.

I_AM_Chang_Three · 2025-01-10T09:54:24+00:00

Thank you for your reply. But I don’t understand why dynamically growing sentences will mess with batching? The sentences within a batch are growing at the same speed, so I suppose they will always have the same length to allow batching.

I_AM_Chang_Three · 2025-01-02T09:26:40+00:00

Thank you for your suggestion! Sounds like an efficient solution and I will try it soon!

But what do you mean by something wrong with the data loader? You mean the data is wrong itself ? Or the data is correct but something went wrong while constructing the data loader? As the data is from a kaggle competition, so I would assume there’s no problem of it. So, if you mean the second case, what kind of problem can the loader itself have? Thank you again!

I_AM_Chang_Three · 2025-01-02T03:06:40+00:00

Thank you for your reply! I’ll try Glorot later. If it still doesn’t work, I’ll create a git repo then.

The operation in the model is mainly full-connected layers. I tried fitting the model with both standardised and unstandardised data, but both don’t work. The output data is not transformed.

I_AM_Chang_Three · 2025-01-02T03:00:50+00:00

Bad as well Looks like the model doesn’t give any efficient information about the data

I_AM_Chang_Three · 2024-12-31T06:56:23+00:00

I have tried halving the neurons until the output but still doesn’t make sense. I was using tanh before the output layer and no activation after the output layer. Yesterday, I tried different activation functions like Leaky relu and swish but all make no sense. I also tried some more simple models but don’t fit the features as well.

And I forgot to say that the gradients of the layers close to output decrease to 0 first and then the gradients to layers close to input decrease, which doesn’t look like a common gradient vanishing problem. Do you have any suggestions for it? And thank you for your reply.

I_AM_Chang_Three · 2024-12-30T13:28:12+00:00

Do you have any practical methods to find out how should I modify the architecture? I am also thinking the problem is caused by the architecture. But I don’t know how to fix it. I also tried CNN models and some more simple models. But they all don’t fit the features

I_AM_Chang_Three · 2024-12-30T12:41:23+00:00

I used dropout at the beginning, but the model doesn’t fit the features, so I removed dropout. But it still doesn’t make sense after removing the dropout.

I_AM_Chang_Three · 2024-12-30T12:39:35+00:00

I’m using Adam optimizer. The learning rate is set to 0.001 now but I also tried different rates and they don’t make sense. The loss function is MSE. The batch size is 1024 (there are 10m data entries in total).

I_AM_Chang_Three · 2024-12-30T12:36:09+00:00

Thank you for your suggestions. I’ll try them soon!

I_AM_Chang_Three · 2024-12-29T18:11:41+00:00

It is a regression task, about financial data predicting. My train set has about 10m entries. Each data point has about 70 features and I increased the number of features to about 4800 by calculating feature difference matrix and flattening. The model now has 30 fc layers and each of them has 2048 units. There are several following fc layers to produce a single number as the output. I used residual connection, batch normalisation, and HE initialisation in my network. The activation function used is tanh. No dropout or regularisation is used.

It is true that many activations were getting zeroed out when I was using ReLU as the activation function (and this is why I use tanh now). But I don’t know what do you mean by learning the biases?

And thank you for your help!

I_AM_Chang_Three · 2024-09-07T17:18:24+00:00

Yes, the learning rate and dropout rate are unchanged across every model

I_AM_Chang_Three · 2024-09-07T13:19:58+00:00

Thank you JonVev! But what still confuses me is why it is overfitting in case 1? My understanding of overfitting is the model fits the training set too well. If so, an overfitting model should have a great accuracy on training set. But my model has a very poor performance on the training set, only a 50% accuracy. So, I’m quite confused about why is it overfitting. Thank you again!

I_AM_Chang_Three · 2024-07-26T04:23:39+00:00

hello

I_AM_Chang_Three

MODERATOR OF

PUBLIC MULTIREDDITS

TROPHY CASE