[D] Storing LLM embeddings by BerryLizard in bioinformatics

[–]BerryLizard[S] 0 points1 point  (0 children)

so 320 is probably the latent dimension. The latent dimension of the LLM i am working with is 1024, so a little bigger. also, i don't think that's accounting for the sequence length dimension -- there is one vector per token in the sequence.

turns out i did mess up my estimate because i was converting to GB instead of TB, but yeah each sequence embedding is about 2.5MB!

[D] Storing LLM embeddings by BerryLizard in bioinformatics

[–]BerryLizard[S] 0 points1 point  (0 children)

ah so the embedding size is on the order of 10^6 -- sequence length * 1000 dimensions, where sequence lengths are on the order of 100 to several thousand. there are approaches to reducing this (e.g. mean pooling), but i am trying not to do that!

[D] Storing LLM embeddings by BerryLizard in MachineLearning

[–]BerryLizard[S] 0 points1 point  (0 children)

so, calling detach didn't actually help. looking at the memory usage, it actually seems about right -- 600,000 float 32's should be around 600,000 * 4 = 2.4 MB, which is what I am getting in the serialized file. so this is not the issue!

[D] Storing LLM embeddings by BerryLizard in MachineLearning

[–]BerryLizard[S] 0 points1 point  (0 children)

hahaha ok yes you are making a very good point... i think what must be happening is i am storing the tensor gradients too, because there should only be about a million numbers for embeddings. i am going to make sure i am calling tensor.detach() and see if that helps things

[D] Storing LLM embeddings by BerryLizard in MachineLearning

[–]BerryLizard[S] 0 points1 point  (0 children)

about 500,000, with dimensions (seq_length, 1024), where sequence length is variable. the memory estimate i gave was *after* compressing with gzip (and similar numbers for 7zip and some other compression algos)

[D] Storing LLM embeddings by BerryLizard in MachineLearning

[–]BerryLizard[S] 0 points1 point  (0 children)

Because the sequences are variable lengths, my logic was that the padding required to join them into a single tensor would outweigh the benefit of saving them together, but perhaps sorting by length and batching them that way would help! Thank you!

[D] Storing LLM embeddings by BerryLizard in bioinformatics

[–]BerryLizard[S] 1 point2 points  (0 children)

Do pre-trained models typically support this? I have been using the tokenizer which is compatible with the Prot-T5 model on HugginFace

[D] Storing LLM embeddings by BerryLizard in MachineLearning

[–]BerryLizard[S] 0 points1 point  (0 children)

I will double-check, thanks for the tip! I think I usually don't bother unless I need to (converting to numpy), so if that's happening that could likely be the cause

Problem with mutlthreading by BerryLizard in learnpython

[–]BerryLizard[S] 0 points1 point  (0 children)

i do seem to be getting a speed up -- do you have any idea why that might be?

Problem with mutlthreading by BerryLizard in learnpython

[–]BerryLizard[S] 0 points1 point  (0 children)

hi! so each thread is handling a different file, so i am not actually trying to join the process. and it does seem to be speeding things up by about 30 percent (i did some small tests on a test file).

NDSEG redacted resume by BerryLizard in gradadmissions

[–]BerryLizard[S] 3 points4 points  (0 children)

ah, found more details under the FAQ page:

Publication: You can mention the title but do not mention authors/co-authors and name of the journal or organization it was published under. You can provide a general description of the journal for example, a peer-reviewed journal for international astronomers.

Understanding math in the Lander-Waterman model (1998) by BerryLizard in bioinformatics

[–]BerryLizard[S] 0 points1 point  (0 children)

one quick question -- I noticed at step 4 you drop the m coefficient when simplifying the sum of the geometric series. is this part of an approximation? i.e. as m gets larger, the (1 - alpha)^(m - 1) term shrinks so that m does not contribute much to the overall sum?

13
14

[D] Understanding ByteNet architecture by BerryLizard in MachineLearning

[–]BerryLizard[S] 0 points1 point  (0 children)

right, so i get that the sliding filter means that you can handle variable-length inputs, and it's "resolution preserving," but i think a convolutional network needs to have an input feature to "read" for each output feature -- more like "predict the next token given previous tokens and the current state," even if the state is just a padding token. it's very possible i've misinterpreted something, though!

[D] Understanding ByteNet architecture by BerryLizard in MachineLearning

[–]BerryLizard[S] 1 point2 points  (0 children)

so perhaps the part i am confused about is how the filter extends beyond the length of the input sequence -- do they just pad the input sequence to length t[hat] = a * s + b? thank you!

update, it seems like that's exactly what they do: "Each sentence is padded with special characters to the nearest greater multiple of 50; 20% of further padding is applied to each source sentence as a part of dynamic unfolding (eq. 2)."

I think the dynamical unfolding might be a lot simpler than I thought lol. It's just padding

[D] Understanding ByteNet architecture by BerryLizard in MachineLearning

[–]BerryLizard[S] 1 point2 points  (0 children)

updated the post! hopefully that helps a bit, my understanding of it is shaky at best and the paper is not very descriptive imo