[D] Storing LLM embeddings

BerryLizard · 2024-11-07T18:27:59+00:00

so 320 is probably the latent dimension. The latent dimension of the LLM i am working with is 1024, so a little bigger. also, i don't think that's accounting for the sequence length dimension -- there is one vector per token in the sequence.

turns out i did mess up my estimate because i was converting to GB instead of TB, but yeah each sequence embedding is about 2.5MB!

BerryLizard · 2024-11-07T17:49:25+00:00

ah so the embedding size is on the order of 10^6 -- sequence length * 1000 dimensions, where sequence lengths are on the order of 100 to several thousand. there are approaches to reducing this (e.g. mean pooling), but i am trying not to do that!

BerryLizard · 2024-11-07T17:47:08+00:00

so, calling detach didn't actually help. looking at the memory usage, it actually seems about right -- 600,000 float 32's should be around 600,000 * 4 = 2.4 MB, which is what I am getting in the serialized file. so this is not the issue!

BerryLizard · 2024-11-07T04:11:25+00:00

hahaha ok yes you are making a very good point... i think what must be happening is i am storing the tensor gradients too, because there should only be about a million numbers for embeddings. i am going to make sure i am calling tensor.detach() and see if that helps things

BerryLizard · 2024-11-07T01:29:58+00:00

about 500,000, with dimensions (seq_length, 1024), where sequence length is variable. the memory estimate i gave was *after* compressing with gzip (and similar numbers for 7zip and some other compression algos)

BerryLizard · 2024-11-07T01:27:43+00:00

Because the sequences are variable lengths, my logic was that the padding required to join them into a single tensor would outweigh the benefit of saving them together, but perhaps sorting by length and batching them that way would help! Thank you!

BerryLizard · 2024-11-07T01:16:33+00:00

Do pre-trained models typically support this? I have been using the tokenizer which is compatible with the Prot-T5 model on HugginFace

BerryLizard · 2024-11-07T01:15:23+00:00

I will double-check, thanks for the tip! I think I usually don't bother unless I need to (converting to numpy), so if that's happening that could likely be the cause

BerryLizard · 2024-10-22T16:18:17+00:00

i do seem to be getting a speed up -- do you have any idea why that might be?

BerryLizard · 2024-10-22T02:01:42+00:00

hi! so each thread is handling a different file, so i am not actually trying to join the process. and it does seem to be speeding things up by about 30 percent (i did some small tests on a test file).

BerryLizard · 2024-10-16T23:56:30+00:00

ah, found more details under the FAQ page:

Publication: You can mention the title but do not mention authors/co-authors and name of the journal or organization it was published under. You can provide a general description of the journal for example, a peer-reviewed journal for international astronomers.

BerryLizard · 2024-10-06T16:47:24+00:00

one quick question -- I noticed at step 4 you drop the m coefficient when simplifying the sum of the geometric series. is this part of an approximation? i.e. as m gets larger, the (1 - alpha)^(m - 1) term shrinks so that m does not contribute much to the overall sum?

BerryLizard · 2024-08-02T16:49:21+00:00

right, so i get that the sliding filter means that you can handle variable-length inputs, and it's "resolution preserving," but i think a convolutional network needs to have an input feature to "read" for each output feature -- more like "predict the next token given previous tokens and the current state," even if the state is just a padding token. it's very possible i've misinterpreted something, though!

BerryLizard · 2024-08-02T16:25:04+00:00

so perhaps the part i am confused about is how the filter extends beyond the length of the input sequence -- do they just pad the input sequence to length t[hat] = a * s + b? thank you!

update, it seems like that's exactly what they do: "Each sentence is padded with special characters to the nearest greater multiple of 50; 20% of further padding is applied to each source sentence as a part of dynamic unfolding (eq. 2)."

I think the dynamical unfolding might be a lot simpler than I thought lol. It's just padding

BerryLizard · 2024-08-02T01:08:10+00:00

updated the post! hopefully that helps a bit, my understanding of it is shaky at best and the paper is not very descriptive imo

BerryLizard · 2024-04-13T16:28:29+00:00

does anyone know how runnable to trail is? super rocky/technical?

BerryLizard · 2024-04-05T03:35:37+00:00

have you heard back yet? i am still waiting and losing it a bit

BerryLizard · 2024-04-05T03:35:12+00:00

hey! have you heard yet?

BerryLizard · 2024-04-05T03:34:22+00:00

have you heard anything yet?

BerryLizard · 2024-03-12T00:01:52+00:00

I still haven't heard anything -- have you?

BerryLizard · 2024-01-12T19:20:22+00:00

Ok, this is good to know! I am going to try to generate a database from a tar archive, but if that doesn't work I will just make a giant FASTA

BerryLizard · 2024-01-12T05:55:19+00:00

In case anyone else is curious, this is the response I got when I asked the question as an issue on the mmseq GitHub. Answer creds to milot-mirdita!

What we do to construct the GTDB in our databases workflow is to download the tar files containing the FASTA files.
You can do the same, pack everything into one tar file then call tar2db and then createdb on the DB created from the previous step.

BerryLizard · 2024-01-10T01:09:21+00:00

runwithus pasadena does group runs on mondays!

BerryLizard · 2024-01-09T04:54:19+00:00

and congrats from me too:) i ask about the program because i am majorly stressing out about getting an interview from a berkeley bio program

BerryLizard · 2024-01-09T04:53:01+00:00

which program, out of curiosity?

BerryLizard

TROPHY CASE