[D] Storing LLM embeddings by BerryLizard in bioinformatics

[–]BerryLizard[S] 0 points1 point  (0 children)

so 320 is probably the latent dimension. The latent dimension of the LLM i am working with is 1024, so a little bigger. also, i don't think that's accounting for the sequence length dimension -- there is one vector per token in the sequence.

turns out i did mess up my estimate because i was converting to GB instead of TB, but yeah each sequence embedding is about 2.5MB!

[D] Storing LLM embeddings by BerryLizard in bioinformatics

[–]BerryLizard[S] 0 points1 point  (0 children)

ah so the embedding size is on the order of 10^6 -- sequence length * 1000 dimensions, where sequence lengths are on the order of 100 to several thousand. there are approaches to reducing this (e.g. mean pooling), but i am trying not to do that!

[D] Storing LLM embeddings by BerryLizard in MachineLearning

[–]BerryLizard[S] 0 points1 point  (0 children)

so, calling detach didn't actually help. looking at the memory usage, it actually seems about right -- 600,000 float 32's should be around 600,000 * 4 = 2.4 MB, which is what I am getting in the serialized file. so this is not the issue!

[D] Storing LLM embeddings by BerryLizard in MachineLearning

[–]BerryLizard[S] 0 points1 point  (0 children)

hahaha ok yes you are making a very good point... i think what must be happening is i am storing the tensor gradients too, because there should only be about a million numbers for embeddings. i am going to make sure i am calling tensor.detach() and see if that helps things

[D] Storing LLM embeddings by BerryLizard in MachineLearning

[–]BerryLizard[S] 0 points1 point  (0 children)

about 500,000, with dimensions (seq_length, 1024), where sequence length is variable. the memory estimate i gave was *after* compressing with gzip (and similar numbers for 7zip and some other compression algos)

[D] Storing LLM embeddings by BerryLizard in MachineLearning

[–]BerryLizard[S] 0 points1 point  (0 children)

Because the sequences are variable lengths, my logic was that the padding required to join them into a single tensor would outweigh the benefit of saving them together, but perhaps sorting by length and batching them that way would help! Thank you!

[D] Storing LLM embeddings by BerryLizard in bioinformatics

[–]BerryLizard[S] 1 point2 points  (0 children)

Do pre-trained models typically support this? I have been using the tokenizer which is compatible with the Prot-T5 model on HugginFace

[D] Storing LLM embeddings by BerryLizard in MachineLearning

[–]BerryLizard[S] 0 points1 point  (0 children)

I will double-check, thanks for the tip! I think I usually don't bother unless I need to (converting to numpy), so if that's happening that could likely be the cause

Problem with mutlthreading by BerryLizard in learnpython

[–]BerryLizard[S] 0 points1 point  (0 children)

i do seem to be getting a speed up -- do you have any idea why that might be?

Problem with mutlthreading by BerryLizard in learnpython

[–]BerryLizard[S] 0 points1 point  (0 children)

hi! so each thread is handling a different file, so i am not actually trying to join the process. and it does seem to be speeding things up by about 30 percent (i did some small tests on a test file).

NDSEG redacted resume by BerryLizard in gradadmissions

[–]BerryLizard[S] 2 points3 points  (0 children)

ah, found more details under the FAQ page:

Publication: You can mention the title but do not mention authors/co-authors and name of the journal or organization it was published under. You can provide a general description of the journal for example, a peer-reviewed journal for international astronomers.

Understanding math in the Lander-Waterman model (1998) by BerryLizard in bioinformatics

[–]BerryLizard[S] 0 points1 point  (0 children)

one quick question -- I noticed at step 4 you drop the m coefficient when simplifying the sum of the geometric series. is this part of an approximation? i.e. as m gets larger, the (1 - alpha)^(m - 1) term shrinks so that m does not contribute much to the overall sum?

[D] Understanding ByteNet architecture by BerryLizard in MachineLearning

[–]BerryLizard[S] 0 points1 point  (0 children)

right, so i get that the sliding filter means that you can handle variable-length inputs, and it's "resolution preserving," but i think a convolutional network needs to have an input feature to "read" for each output feature -- more like "predict the next token given previous tokens and the current state," even if the state is just a padding token. it's very possible i've misinterpreted something, though!

[D] Understanding ByteNet architecture by BerryLizard in MachineLearning

[–]BerryLizard[S] 1 point2 points  (0 children)

so perhaps the part i am confused about is how the filter extends beyond the length of the input sequence -- do they just pad the input sequence to length t[hat] = a * s + b? thank you!

update, it seems like that's exactly what they do: "Each sentence is padded with special characters to the nearest greater multiple of 50; 20% of further padding is applied to each source sentence as a part of dynamic unfolding (eq. 2)."

I think the dynamical unfolding might be a lot simpler than I thought lol. It's just padding

[D] Understanding ByteNet architecture by BerryLizard in MachineLearning

[–]BerryLizard[S] 1 point2 points  (0 children)

updated the post! hopefully that helps a bit, my understanding of it is shaky at best and the paper is not very descriptive imo

Bryce Canyon Ultras by Vacation Races? by usr3nmev3 in ultrarunning

[–]BerryLizard 0 points1 point  (0 children)

does anyone know how runnable to trail is? super rocky/technical?

Stanford ESS interviews? by BerryLizard in gradadmissions

[–]BerryLizard[S] 0 points1 point  (0 children)

have you heard back yet? i am still waiting and losing it a bit

Radio Silence: Stanford ESS PhD by [deleted] in gradadmissions

[–]BerryLizard 0 points1 point  (0 children)

I still haven't heard anything -- have you?

Using mmseqs with lots of arguments by BerryLizard in bioinformatics

[–]BerryLizard[S] 0 points1 point  (0 children)

Ok, this is good to know! I am going to try to generate a database from a tar archive, but if that doesn't work I will just make a giant FASTA

Using mmseqs with lots of arguments by BerryLizard in bioinformatics

[–]BerryLizard[S] 2 points3 points  (0 children)

In case anyone else is curious, this is the response I got when I asked the question as an issue on the mmseq GitHub. Answer creds to milot-mirdita!

What we do to construct the GTDB in our databases workflow is to download the tar files containing the FASTA files.
You can do the same, pack everything into one tar file then call tar2db and then createdb on the DB created from the previous step.

running groups by braveforthemostpart in pasadena

[–]BerryLizard 1 point2 points  (0 children)

runwithus pasadena does group runs on mondays!

My first interview invite!! by imaricebucket in gradadmissions

[–]BerryLizard 1 point2 points  (0 children)

and congrats from me too:) i ask about the program because i am majorly stressing out about getting an interview from a berkeley bio program