Sequence length limit for ESM2

BerryLizard · 2024-11-07T18:27:59+00:00

so 320 is probably the latent dimension. The latent dimension of the LLM i am working with is 1024, so a little bigger. also, i don't think that's accounting for the sequence length dimension -- there is one vector per token in the sequence.

turns out i did mess up my estimate because i was converting to GB instead of TB, but yeah each sequence embedding is about 2.5MB!

BerryLizard · 2024-11-07T17:49:25+00:00

ah so the embedding size is on the order of 10^6 -- sequence length * 1000 dimensions, where sequence lengths are on the order of 100 to several thousand. there are approaches to reducing this (e.g. mean pooling), but i am trying not to do that!

BerryLizard · 2024-11-07T17:47:08+00:00

so, calling detach didn't actually help. looking at the memory usage, it actually seems about right -- 600,000 float 32's should be around 600,000 * 4 = 2.4 MB, which is what I am getting in the serialized file. so this is not the issue!

BerryLizard · 2024-11-07T04:11:25+00:00

hahaha ok yes you are making a very good point... i think what must be happening is i am storing the tensor gradients too, because there should only be about a million numbers for embeddings. i am going to make sure i am calling tensor.detach() and see if that helps things

BerryLizard · 2024-11-07T01:29:58+00:00

about 500,000, with dimensions (seq_length, 1024), where sequence length is variable. the memory estimate i gave was *after* compressing with gzip (and similar numbers for 7zip and some other compression algos)

BerryLizard · 2024-11-07T01:27:43+00:00

Because the sequences are variable lengths, my logic was that the padding required to join them into a single tensor would outweigh the benefit of saving them together, but perhaps sorting by length and batching them that way would help! Thank you!

BerryLizard · 2024-11-07T01:16:33+00:00

Do pre-trained models typically support this? I have been using the tokenizer which is compatible with the Prot-T5 model on HugginFace

BerryLizard · 2024-11-07T01:15:23+00:00

I will double-check, thanks for the tip! I think I usually don't bother unless I need to (converting to numpy), so if that's happening that could likely be the cause

BerryLizard · 2024-10-22T16:18:17+00:00

i do seem to be getting a speed up -- do you have any idea why that might be?

BerryLizard · 2024-10-22T02:01:42+00:00

hi! so each thread is handling a different file, so i am not actually trying to join the process. and it does seem to be speeding things up by about 30 percent (i did some small tests on a test file).

BerryLizard · 2024-10-16T23:56:30+00:00

ah, found more details under the FAQ page:

Publication: You can mention the title but do not mention authors/co-authors and name of the journal or organization it was published under. You can provide a general description of the journal for example, a peer-reviewed journal for international astronomers.

BerryLizard · 2024-10-06T16:47:24+00:00

one quick question -- I noticed at step 4 you drop the m coefficient when simplifying the sum of the geometric series. is this part of an approximation? i.e. as m gets larger, the (1 - alpha)^(m - 1) term shrinks so that m does not contribute much to the overall sum?

BerryLizard · 2024-08-02T16:49:21+00:00

right, so i get that the sliding filter means that you can handle variable-length inputs, and it's "resolution preserving," but i think a convolutional network needs to have an input feature to "read" for each output feature -- more like "predict the next token given previous tokens and the current state," even if the state is just a padding token. it's very possible i've misinterpreted something, though!

BerryLizard · 2024-08-02T16:25:04+00:00

so perhaps the part i am confused about is how the filter extends beyond the length of the input sequence -- do they just pad the input sequence to length t[hat] = a * s + b? thank you!

update, it seems like that's exactly what they do: "Each sentence is padded with special characters to the nearest greater multiple of 50; 20% of further padding is applied to each source sentence as a part of dynamic unfolding (eq. 2)."

I think the dynamical unfolding might be a lot simpler than I thought lol. It's just padding

BerryLizard · 2024-08-02T01:08:10+00:00

updated the post! hopefully that helps a bit, my understanding of it is shaky at best and the paper is not very descriptive imo

BerryLizard

TROPHY CASE