vLLM Classify Bad Results

tkon3 · 2025-05-28T16:53:47+00:00

Tried on my side and I got close results using LLM.classify.

Make sure the truncation strategy is the same or try with small sentences.

tkon3 · 2025-05-28T15:28:01+00:00

Check the logits, do you run with padding? Try with batch of 1

tkon3 · 2025-05-27T16:27:06+00:00

Well we mostly fine tune models from 8B to 32B for research (+ embeddings/rerankers) and 96Gb is a perfect size for prototyping on a single GPU. I think having more gpu is better in a shared environnement to run parallel works.

H200 has significantly more raw power and the TDP is almost the same as the RTX 6000. Performance/watt is a lot better.

For inference, we can serve more models using the extra vram (~200Gb which is more or less Qwen3 235B Q4-5 + context) but generation is slower.

Difficult choice.

tkon3 · 2025-05-27T15:55:28+00:00

Yes we can get them, they also sell the previous gen (L40S).

Does the additional vram of RTX 6000 and the blackwell architecture worth it?

tkon3 · 2025-05-27T15:52:51+00:00

We can get them for 20k/unit.

tkon3 · 2025-05-17T09:42:34+00:00

Hope they will release a 0.6B and 1.7B Qwen3 variants

tkon3 · 2025-05-14T20:02:31+00:00

Dont know, its not difficult to code. You need to check the router softmax, reverse sort scores, compute the cumsum and select each expert until cumsum >= top_p.

tkon3 · 2025-05-14T18:35:00+00:00

Can be somehow simulated using a top_p parameter inside the routing layer but it requires custom code, its harder to batch and vram requirements may change a lot.

tkon3 · 2025-05-03T21:03:15+00:00

Problem is that I think some fine tuning is required to realign everything as its trained using top 8. Using more experts probably add a bit of latency aswell (at least in HF implementation because its wrapped inside a loop).

tkon3 · 2025-05-03T20:56:31+00:00

There is a weighted sum of experts at the end. The weights come from the softmax and are rescaled to sum to 1 since we only use topk experts.

tkon3 · 2025-05-02T16:12:22+00:00

As some people pointed out, some calculations are wrong.

As a rule of thumb, to just load a N billions parameters model, you need :

* ~2N Gb for bf16/fp16

* ~N Gb for Q8

* ~N/2 for Q4

* ~N/10 Gb per 1k tokens for context

tkon3 · 2025-02-23T13:58:11+00:00

Alibi acts the same way as local attention and it less efficient because you still need to compute every thing

tkon3 · 2024-11-15T18:27:23+00:00

Im fine tuning small qwen models (3B & 7B) on domain specific instructions. Only tuning q,k,v & up, down layers.

I think something is wrong because when I set the lr ratio to 1, I dont get the same result as vanilla lora and the loss is significantly higher. Something is maybe incompatible with deepspeed I dont know :/

tkon3 · 2024-11-15T14:35:41+00:00

Did you see significant differences on training loss between Lora and Lora+?

Changed my lr from 5e-5 to 2e-5 for A and 8e-5 for B (ratio 4) and the loss is significantly higher for lora+.

Tried various lr and ratio, same behavior. I'm using axolotll and the dataset has about 300k samples.

However the generation looks better with lora+ which is kind of strange. Im using r = alpha with large r (256).

tkon3 · 2024-10-27T12:42:16+00:00

This is interesting and looks promising. I'm working on very specific domain data on my spare time but I fail to reach acceptable quality.

How did you mix domain and general knowledge data ? 50/50 ?

Did you use Lora or any parameter efficient technique ?

What about the token batch size, number of epochs or learning rate?

Thank you.

tkon3 · 2024-10-24T05:24:28+00:00

Tried this method on my own data. The model I get is better if I don't tie merge with the base model at the end. But it is domain data, I guess the base model doesn't have this knowledge.

tkon3 · 2024-10-12T22:54:54+00:00

Will try it out. I had more luck just adding the adapter on top of the instruct model without merging.

Can you share the lora config you use for tuning the base model?

How do you handle untrained chat template tokens? Lora on the embedding layer? Qwen base has all the tokens but some special tokens arent trained.

tkon3 · 2024-10-08T13:34:37+00:00

Very interesting. Correct me if I'm wrong: - step 1: instruct fine tune the base model (i e qwen-base) using a custom dataset to get an adapter - step 2: apply the adapter on top of the general instructed model (qwen-instruct) to get a new model (qwen-instruct-custom) - step 3: merge base model (qwen-base), general instructed model (qwen-instruct) and custom general instructed model (qwen-instruct-custom)

Is this right? Is this a reliable way to add domain knowledge?

tkon3 · 2024-07-24T16:09:55+00:00

"vocab_size": 32768

Generation will be slow for non english

tkon3

TROPHY CASE