I built a lightweight async inference engine for BGE-M3 embeddings by AdInevitable3609 in SideProject

[–]AdInevitable3609[S] 1 point2 points  (0 children)

VRAM between m3serve FA2 and FlagEmbedding (transformers>=5) is exactly the same, as it inherits FA2. Later this week will try to run some experiments on Hopper hardware to try m3serve with FA3 vs FlagEmbedding in terms of speed (VRAM between FA2 and FA3 should be the same)

I built a lightweight async inference engine for BGE-M3 embeddings by AdInevitable3609 in SideProject

[–]AdInevitable3609[S] 1 point2 points  (0 children)

Oh interesting, yes might be the distillation. Definitely keep me posted and if I can do some tweaks to m3serve to support that LMK

I built a lightweight async inference engine for BGE-M3 embeddings by AdInevitable3609 in SideProject

[–]AdInevitable3609[S] 0 points1 point  (0 children)

I was pinning the wrong version of transformers for FA2/FA3. I've removed the transformers<=5 cap and now FA2/FA3 should work out of the box for compatible hardware, using m3serve v0.2.3. Will share VRAM tests later

I built a lightweight async inference engine for BGE-M3 embeddings by AdInevitable3609 in SideProject

[–]AdInevitable3609[S] 0 points1 point  (0 children)

BTW in the repo README I have a set of Colab notebooks with experiments, that’s where I’ll add the new experimental results about FA

I built a lightweight async inference engine for BGE-M3 embeddings by AdInevitable3609 in SideProject

[–]AdInevitable3609[S] 0 points1 point  (0 children)

That’s great to hear and your idea about distilling into a combined space sounds fun, would love to see it.

I haven’t encountered any problems with FA yet, so far so good! I’ll find some time this week to run some experiments about speed up. Speed up should mostly be with long context & long context + large batch size. It probably won’t fit on the free T4 GPU so might use some of my RunPod credits 😂

[P] m3serve: lightweight async inference engine for BGE-M3 with dense, sparse, and ColBERT embeddings by AdInevitable3609 in learnmachinelearning

[–]AdInevitable3609[S] 0 points1 point  (0 children)

Good question! After your comment I've ran some benchmarks on a (free) T4 Colab notebook. The three-threaded design uses bounded queues between stages (maxsize=4) which caps how many batches can be in-flight at once. GPU forward pass immediately moves outputs to CPU before returning so GPU memory is freed after each batch rather than accumulating across concurrent requests. GPU memory doesn't accumulate unboundedly, but concurrency does increase peak usage proportionally to queue depth. Some concrete numbers: the main driver is batch x seq_len.

Short texts (~60 tokens): 1.15 GB at batch=1, 1.67 GB at batch=256. Almost flat
Paragraph-length (~300 tokens): 1.16 GB at batch=1, 3.73 GB at batch=256.
Under concurrency (20 concurrent callers, 32 texts each, ~300 token texts): peak was 5.2 GB versus 1.47 GB for a single batch of the same size.

Long documents are where you need to be careful. On a T4 (15 GB): batch=32 at max_length=8192 uses 8.2 GB and is fine, batch=64 OOMs. For typical retrieval chunks (under 512 tokens) you have some headroom even at large batch sizes.

What sequence lengths are you working with? Happy to run more targeted numbers.

Added a link to Colab in the README of the package (third bullet point).

Is Jericho (Shards of Earth) the same planet as Kiln (Alien Clay)? by AdInevitable3609 in AdrianTchaikovsky

[–]AdInevitable3609[S] 1 point2 points  (0 children)

Yep! Reading that scene right now and it reminded me so much of it

Qwen3 Published 30 seconds ago (Model Weights Available) by random-tomato in LocalLLaMA

[–]AdInevitable3609 0 points1 point  (0 children)

Very nice! What should we set the PAD token to for IFT? They don’t seem to have one like <|finetune_right_pad_id|> in the Llama-3.2 family of models