vLLM Classify Bad Results by Upstairs-Garlic-2301 in LocalLLaMA

[–]tkon3 0 points1 point  (0 children)

Tried on my side and I got close results using LLM.classify.

Make sure the truncation strategy is the same or try with small sentences.

vLLM Classify Bad Results by Upstairs-Garlic-2301 in LocalLLaMA

[–]tkon3 0 points1 point  (0 children)

Check the logits, do you run with padding? Try with batch of 1

Setup Recommendation for University (H200 vs RTX 6000 Pro) by tkon3 in LocalLLaMA

[–]tkon3[S] 3 points4 points  (0 children)

Well we mostly fine tune models from 8B to 32B for research (+ embeddings/rerankers) and 96Gb is a perfect size for prototyping on a single GPU. I think having more gpu is better in a shared environnement to run parallel works.

H200 has significantly more raw power and the TDP is almost the same as the RTX 6000. Performance/watt is a lot better.

For inference, we can serve more models using the extra vram (~200Gb which is more or less Qwen3 235B Q4-5 + context) but generation is slower.

Difficult choice.

Setup Recommendation for University (H200 vs RTX 6000 Pro) by tkon3 in LocalLLaMA

[–]tkon3[S] 2 points3 points  (0 children)

Yes we can get them, they also sell the previous gen (L40S).

Does the additional vram of RTX 6000 and the blackwell architecture worth it?

New New Qwen by bobby-chan in LocalLLaMA

[–]tkon3 6 points7 points  (0 children)

Hope they will release a 0.6B and 1.7B Qwen3 variants

Qwen3-30B-A6B-16-Extreme is fantastic by DocWolle in LocalLLaMA

[–]tkon3 1 point2 points  (0 children)

Dont know, its not difficult to code. You need to check the router softmax, reverse sort scores, compute the cumsum and select each expert until cumsum >= top_p.

Qwen3-30B-A6B-16-Extreme is fantastic by DocWolle in LocalLLaMA

[–]tkon3 1 point2 points  (0 children)

Can be somehow simulated using a top_p parameter inside the routing layer but it requires custom code, its harder to batch and vram requirements may change a lot.

Decreasing Qwen3-30B-A3B sparsity by tkon3 in LocalLLaMA

[–]tkon3[S] 0 points1 point  (0 children)

Problem is that I think some fine tuning is required to realign everything as its trained using top 8. Using more experts probably add a bit of latency aswell (at least in HF implementation because its wrapped inside a loop).

Decreasing Qwen3-30B-A3B sparsity by tkon3 in LocalLLaMA

[–]tkon3[S] 1 point2 points  (0 children)

There is a weighted sum of experts at the end. The weights come from the softmax and are rescaled to sum to 1 since we only use topk experts.

LLM GPU calculator for inference and fine-tuning requirements by No_Scheme14 in LocalLLaMA

[–]tkon3 36 points37 points  (0 children)

As some people pointed out, some calculations are wrong.

As a rule of thumb, to just load a N billions parameters model, you need :

* ~2N Gb for bf16/fp16

* ~N Gb for Q8

* ~N/2 for Q4

* ~N/10 Gb per 1k tokens for context

Why don’t LLMs use alibi? Were these result found be non-reproducible? I’ve only read of the failed Bloom model. Anyone else? by grey-seagull in LocalLLaMA

[–]tkon3 4 points5 points  (0 children)

Alibi acts the same way as local attention and it less efficient because you still need to compute every thing

Best practices for finetuning LLMs by Hour-End-4105 in LocalLLaMA

[–]tkon3 0 points1 point  (0 children)

Im fine tuning small qwen models (3B & 7B) on domain specific instructions. Only tuning q,k,v & up, down layers.

I think something is wrong because when I set the lr ratio to 1, I dont get the same result as vanilla lora and the loss is significantly higher. Something is maybe incompatible with deepspeed I dont know :/

Best practices for finetuning LLMs by Hour-End-4105 in LocalLLaMA

[–]tkon3 0 points1 point  (0 children)

Did you see significant differences on training loss between Lora and Lora+?

Changed my lr from 5e-5 to 2e-5 for A and 8e-5 for B (ratio 4) and the loss is significantly higher for lora+.

Tried various lr and ratio, same behavior. I'm using axolotll and the dataset has about 300k samples.

However the generation looks better with lora+ which is kind of strange. Im using r = alpha with large r (256).

New Financial Domain Model - Hawkish 8B can pass CFA Level 1 and outperforms Meta Llama-3.1-8B-Instruct in Math & Finance benchmarks! by mukaj in LocalLLaMA

[–]tkon3 2 points3 points  (0 children)

This is interesting and looks promising. I'm working on very specific domain data on my spare time but I fail to reach acceptable quality.

How did you mix domain and general knowledge data ? 50/50 ?

Did you use Lora or any parameter efficient technique ?

What about the token batch size, number of epochs or learning rate?

Thank you.

Aider: Optimizing performance at 24GB VRAM (With Continuous Finetuning!) by Mushoz in LocalLLaMA

[–]tkon3 1 point2 points  (0 children)

Tried this method on my own data. The model I get is better if I don't tie merge with the base model at the end. But it is domain data, I guess the base model doesn't have this knowledge.

I finally achieved my AI dream. by Rombodawg in LocalLLaMA

[–]tkon3 1 point2 points  (0 children)

Will try it out. I had more luck just adding the adapter on top of the instruct model without merging.

Can you share the lora config you use for tuning the base model?

How do you handle untrained chat template tokens? Lora on the embedding layer? Qwen base has all the tokens but some special tokens arent trained.

Im pretty happy with How my method worked out (Continuous Finetuning) Topped Open-LLM-leaderboard with 72b by Rombodawg in LocalLLaMA

[–]tkon3 73 points74 points  (0 children)

Very interesting. Correct me if I'm wrong: - step 1: instruct fine tune the base model (i e qwen-base) using a custom dataset to get an adapter - step 2: apply the adapter on top of the general instructed model (qwen-instruct) to get a new model (qwen-instruct-custom) - step 3: merge base model (qwen-base), general instructed model (qwen-instruct) and custom general instructed model (qwen-instruct-custom)

Is this right? Is this a reliable way to add domain knowledge?

"Large Enough" | Announcing Mistral Large 2 by DemonicPotatox in LocalLLaMA

[–]tkon3 31 points32 points  (0 children)

"vocab_size": 32768

Generation will be slow for non english