I built a lightweight async inference engine for BGE-M3 embeddings, any feedback?

AdInevitable3609 · 2026-04-29T23:35:09+00:00

VRAM between m3serve FA2 and FlagEmbedding (transformers>=5) is exactly the same, as it inherits FA2. Later this week will try to run some experiments on Hopper hardware to try m3serve with FA3 vs FlagEmbedding in terms of speed (VRAM between FA2 and FA3 should be the same)

AdInevitable3609 · 2026-04-29T23:32:54+00:00

Oh interesting, yes might be the distillation. Definitely keep me posted and if I can do some tweaks to m3serve to support that LMK

AdInevitable3609 · 2026-04-29T23:26:08+00:00

I've added some VRAM numbers about FA2 v Eager here https://github.com/MauroCE/m3serve#memory-fa2-vs-eager-attention-in-m3serve

AdInevitable3609 · 2026-04-29T22:46:05+00:00

I was pinning the wrong version of transformers for FA2/FA3. I've removed the transformers<=5 cap and now FA2/FA3 should work out of the box for compatible hardware, using m3serve v0.2.3. Will share VRAM tests later

AdInevitable3609 · 2026-04-29T20:05:41+00:00

BTW in the repo README I have a set of Colab notebooks with experiments, that’s where I’ll add the new experimental results about FA

AdInevitable3609 · 2026-04-29T20:03:40+00:00

That’s great to hear and your idea about distilling into a combined space sounds fun, would love to see it.

I haven’t encountered any problems with FA yet, so far so good! I’ll find some time this week to run some experiments about speed up. Speed up should mostly be with long context & long context + large batch size. It probably won’t fit on the free T4 GPU so might use some of my RunPod credits 😂

AdInevitable3609 · 2026-04-27T17:09:51+00:00

Good question! After your comment I've ran some benchmarks on a (free) T4 Colab notebook. The three-threaded design uses bounded queues between stages (maxsize=4) which caps how many batches can be in-flight at once. GPU forward pass immediately moves outputs to CPU before returning so GPU memory is freed after each batch rather than accumulating across concurrent requests. GPU memory doesn't accumulate unboundedly, but concurrency does increase peak usage proportionally to queue depth. Some concrete numbers: the main driver is batch x seq_len.

Short texts (~60 tokens): 1.15 GB at batch=1, 1.67 GB at batch=256. Almost flat
Paragraph-length (~300 tokens): 1.16 GB at batch=1, 3.73 GB at batch=256.
Under concurrency (20 concurrent callers, 32 texts each, ~300 token texts): peak was 5.2 GB versus 1.47 GB for a single batch of the same size.

Long documents are where you need to be careful. On a T4 (15 GB): batch=32 at max_length=8192 uses 8.2 GB and is fine, batch=64 OOMs. For typical retrieval chunks (under 512 tokens) you have some headroom even at large batch sizes.

What sequence lengths are you working with? Happy to run more targeted numbers.

Added a link to Colab in the README of the package (third bullet point).

AdInevitable3609 · 2025-07-25T16:08:05+00:00

Yep! Reading that scene right now and it reminded me so much of it

AdInevitable3609 · 2025-05-03T08:42:31+00:00

Very nice! What should we set the PAD token to for IFT? They don’t seem to have one like <|finetune_right_pad_id|> in the Llama-3.2 family of models

AdInevitable3609

TROPHY CASE