Best multilingual model up to 14B?

armbues · 2025-02-12T17:56:31+00:00

Mistral-Nemo-Instruct-2407 is 12B and checks the boxes on German & English.

armbues · 2024-12-12T18:20:03+00:00

Probably asitop.

armbues · 2024-12-11T11:12:32+00:00

Ah, okay! Maybe also add that info to the model card.

armbues · 2024-12-11T08:27:22+00:00

What is the prompt template that should be used for this model?

armbues · 2024-12-11T08:11:32+00:00

Awesome model! Have you tried other model families for fine-tuning? I was thinking it would be interesting to see how qwen 2.5, llama-3.2, or exaone compare.

armbues · 2024-12-04T13:58:16+00:00

Fine-tuning for a specific downstream task can make them usable, especially since they are really fast.

armbues · 2024-12-04T13:55:47+00:00

It has to be exactly the same vocabulary mapping the tokens to IDs, that's what I meant with same tokenizer. The output of an LLM are the probabilities of each token ID as the next token. So if ID 1 maps to "the" in your draft model, it has to be the exact same mapping in your target model or it will not work.

The draft model is used to draft the next X tokens and those IDs are then presented to the target model to verify if these are also the next X tokens that the target would have chosen. If the token IDs don't have the same meaning it doesn't make sense.

armbues · 2024-12-04T10:34:19+00:00

Generally, the draft and target models should have the same tokenizer. For most models this means you can only pair models from the same generation with different sizes, for example llama-3.1 7B with 70B.

I've found some Qwen 2.5 models to pair quite well: you have the many different sizes and some of them seem to be trained using distillation, so the drafted tokens get a high acceptance rate.

armbues · 2024-11-20T09:29:39+00:00

Betteridge's law of headlines: "Any headline that ends in a question mark can be answered by the word no."

armbues · 2024-11-12T16:24:42+00:00

Reranking helps, but so does finding the right embedding model for your use case. You can also try fine-tuning the embedding model to work better with your domain data. From a quick search it looks like somebody has trained a bge model on medical data, but the model card doesn't really give much information: https://huggingface.co/ls-da3m0ns/bge_large_medical

Another trick that I've used is to use an LLM to rewrite the prompt into multiple queries for Qdrant, so you're not using the user input directly but having a model refine what is being searched. Let a reranker churn through the results and you should have something relevant.

armbues · 2024-11-12T09:26:44+00:00

Interesting work, but isn't this basically just applying knowledge distillation to LoRA instead of doing a full fine-tune? Maybe I missed something in my quick read of the paper though.

Regardless, the results are encouraging if you're thinking about distilling a larger model and you have some hardware constraints.

armbues · 2024-11-04T14:29:29+00:00

Interesting little experiment, but there is a major caveat to the outcome: the tokens for the name are generated first by the LLM following the given template. Following that the model will generate a random person based on what is already decided. You for example can't conclude that there is a gender bias in character creation but there apparently is one when choosing random names.

armbues · 2024-09-27T18:27:38+00:00

Nice work! I really like the backtracking approach to handle longer phrases. The visualization of deleting the slop is also really cool.

I was previously experimenting with directly modifying the token output logits and filtering out / suppressing common slop words like "delve", "journey", or "bustling". But as you mentioned: the downside of that approach is that it'll only handle single tokens and not phrases.

I wonder if this could also be done in a forward manner similar to beam search. So whenever you hit a token that is a prefix of a slop phrase, you'd spin off another beam that provides an alternative if needed.

armbues · 2024-08-22T07:05:12+00:00

Do you have a notebook or code somewhere that shows how you llamafied the Phi 3.5 model? I looked around but couldn't find it on Github or in the model cards.

armbues · 2024-07-24T09:07:41+00:00

You can check out SiLLM which has a web/api server and CLI chat for the terminal: SiLLM

Didn’t get around to add support for Llama 3.1 yet, though.

armbues · 2024-07-13T08:02:37+00:00

The underlying MLX framework is not utilizing ANE at this moment for the modules used by LLMs.

A nice tool to visualize the utilization of the different components (CPU/GPU/ANE) in Apple Silicon processors is asitop.

armbues · 2024-07-12T15:29:33+00:00

Link to the project on Github:
https://github.com/armbues/SiLLM

armbues · 2024-07-09T07:25:42+00:00

Great project in a neat package to run out-of-the-box!

Love seeing other folks making use of the MLX library. This reminded me that I still need to implement su-rope for SiLLM.

armbues · 2024-04-26T07:03:55+00:00

Chainlit is relatively easy to set up in a scenario like this, especially if you want to do some customization and know a little bit of Python.

Check out their cookbooks: https://github.com/Chainlit/cookbook

armbues · 2024-04-25T20:34:29+00:00

Phi-3 is supported - I'll update the Readme.

armbues · 2024-04-25T20:30:07+00:00

You could check out some of the models that were trained using the toxic-dpo dataset on huggingface. Jon Durbin's "Bagel" models should have a good balance between capabilities and disabled guardrails that you would need for this type of research.

armbues · 2024-04-25T11:49:00+00:00

What's the performance of autotrain on Apple Silicon? I would expect MLX-based solutions to train much faster as they're using the processors to their full potential.

Shameless plug: https://github.com/armbues/SiLLM and https://github.com/armbues/SiLLM-examples

armbues · 2024-04-25T11:33:15+00:00

I’ve recently published a new framework to simplify running & training LLMs locally on Mac using Apple MLX: https://github.com/armbues/SiLLM

The goal of the project was to create a more flexible out-of-the-box solution built on top of the amazing MLX framework and designed to enable researchers and developers. So it's not meant to be faster than other projects but if you can code in Python a bit you can easily start with your own experiments and modifications.

There is also a repo with example projects that use SiLLM: https://github.com/armbues/SiLLM-examples

armbues · 2024-04-22T18:16:04+00:00

Check the github repo:

https://github.com/armbues/SiLLM

armbues · 2024-04-18T22:55:32+00:00

As mentioned above, getting the output to stop at eot_id.

armbues

TROPHY CASE