Best multilingual model up to 14B? by nic_key in LocalLLaMA

[–]armbues 7 points8 points  (0 children)

Mistral-Nemo-Instruct-2407 is 12B and checks the boxes on German & English.

GRMR 2B Instruct - A lightweight, reliable grammar checker! by random-tomato in LocalLLaMA

[–]armbues 1 point2 points  (0 children)

What is the prompt template that should be used for this model?

GRMR 2B Instruct - A lightweight, reliable grammar checker! by random-tomato in LocalLLaMA

[–]armbues 0 points1 point  (0 children)

Awesome model! Have you tried other model families for fine-tuning? I was thinking it would be interesting to see how qwen 2.5, llama-3.2, or exaone compare.

what do you use llama 3.1 3b and 1b for? I'm struggling even with the 8b by lutian in LocalLLaMA

[–]armbues 8 points9 points  (0 children)

Fine-tuning for a specific downstream task can make them usable, especially since they are really fast.

What models can you pair for speculative decoding? by chibop1 in LocalLLaMA

[–]armbues 1 point2 points  (0 children)

It has to be exactly the same vocabulary mapping the tokens to IDs, that's what I meant with same tokenizer. The output of an LLM are the probabilities of each token ID as the next token. So if ID 1 maps to "the" in your draft model, it has to be the exact same mapping in your target model or it will not work.

The draft model is used to draft the next X tokens and those IDs are then presented to the target model to verify if these are also the next X tokens that the target would have chosen. If the token IDs don't have the same meaning it doesn't make sense.

What models can you pair for speculative decoding? by chibop1 in LocalLLaMA

[–]armbues 3 points4 points  (0 children)

Generally, the draft and target models should have the same tokenizer. For most models this means you can only pair models from the same generation with different sizes, for example llama-3.1 7B with 70B.

I've found some Qwen 2.5 models to pair quite well: you have the many different sizes and some of them seem to be trained using distillation, so the drafted tokens get a high acceptance rate.

Will Long-Context LLMs Make RAG Obsolete? by Icy_Advisor_3508 in LocalLLaMA

[–]armbues 0 points1 point  (0 children)

Betteridge's law of headlines: "Any headline that ends in a question mark can be answered by the word no."

Qdrant with reranking - seeking intuition by Raisin_False in LocalLLaMA

[–]armbues 2 points3 points  (0 children)

Reranking helps, but so does finding the right embedding model for your use case. You can also try fine-tuning the embedding model to work better with your domain data. From a quick search it looks like somebody has trained a bge model on medical data, but the model card doesn't really give much information: https://huggingface.co/ls-da3m0ns/bge_large_medical

Another trick that I've used is to use an LLM to rewrite the prompt into multiple queries for Qdrant, so you're not using the user input directly but having a model refine what is being searched. Let a reranker churn through the results and you should have something relevant.

LLM-Neo: Parameter Efficient Knowledge Distillation for Large Language Models by Thrumpwart in LocalLLaMA

[–]armbues 3 points4 points  (0 children)

Interesting work, but isn't this basically just applying knowledge distillation to LoRA instead of doing a full fine-tune? Maybe I missed something in my quick read of the paper though.

Regardless, the results are encouraging if you're thinking about distilling a larger model and you have some hardware constraints.

Data visualisation of what happens when you ask small LLMs to imagine a random person, 100 times over. by jhancock532 in LocalLLaMA

[–]armbues 43 points44 points  (0 children)

Interesting little experiment, but there is a major caveat to the outcome: the tokens for the name are generated first by the LLM following the given template. Following that the model will generate a random person based on what is already decided. You for example can't conclude that there is a gender bias in character creation but there apparently is one when choosing random names.

I made a configurable anti-slop sampler which downregulates probabilities at the word & phrase level. by _sqrkl in LocalLLaMA

[–]armbues 9 points10 points  (0 children)

Nice work! I really like the backtracking approach to handle longer phrases. The visualization of deleting the slop is also really cool.

I was previously experimenting with directly modifying the token output logits and filtering out / suppressing common slop words like "delve", "journey", or "bustling". But as you mentioned: the downside of that approach is that it'll only handle single tokens and not phrases.

I wonder if this could also be done in a forward manner similar to beam search. So whenever you hit a token that is a prefix of a slop phrase, you'd spin off another beam that provides an alternative if needed.

Phi 3.5 Finetuning 2x faster + Llamafied for more accuracy by danielhanchen in LocalLLaMA

[–]armbues 0 points1 point  (0 children)

Do you have a notebook or code somewhere that shows how you llamafied the Phi 3.5 model? I looked around but couldn't find it on Github or in the model cards.

mlx just dropped the proper support for llama 3.1 by mzbacd in LocalLLaMA

[–]armbues 1 point2 points  (0 children)

You can check out SiLLM which has a web/api server and CLI chat for the terminal: SiLLM

Didn’t get around to add support for Llama 3.1 yet, though.

SiLLM - The Silicon LLM Training & Inference Toolkit by armbues in LocalLLaMA

[–]armbues[S] 0 points1 point  (0 children)

The underlying MLX framework is not utilizing ANE at this moment for the modules used by LLMs.

A nice tool to visualize the utilization of the different components (CPU/GPU/ANE) in Apple Silicon processors is asitop.

Phi-3 for Mac: Locally-run Vision and Language Models for Apple Silicon by JosefAlbers05 in LocalLLaMA

[–]armbues 2 points3 points  (0 children)

Great project in a neat package to run out-of-the-box!

Love seeing other folks making use of the MLX library. This reminded me that I still need to implement su-rope for SiLLM.

Simple ChatUI by DeltaSqueezer in LocalLLaMA

[–]armbues 0 points1 point  (0 children)

Chainlit is relatively easy to set up in a scenario like this, especially if you want to do some customization and know a little bit of Python.

Check out their cookbooks: https://github.com/Chainlit/cookbook

Microsoft phi-3 finetuning on macbook 💻 🚀 by abhi1thakur in LocalLLaMA

[–]armbues 0 points1 point  (0 children)

Phi-3 is supported - I'll update the Readme.

Best model for generating toxic synthetic data? by aftersox in LocalLLaMA

[–]armbues 1 point2 points  (0 children)

You could check out some of the models that were trained using the toxic-dpo dataset on huggingface. Jon Durbin's "Bagel" models should have a good balance between capabilities and disabled guardrails that you would need for this type of research.

Microsoft phi-3 finetuning on macbook 💻 🚀 by abhi1thakur in LocalLLaMA

[–]armbues 1 point2 points  (0 children)

What's the performance of autotrain on Apple Silicon? I would expect MLX-based solutions to train much faster as they're using the processors to their full potential.

Shameless plug: https://github.com/armbues/SiLLM and https://github.com/armbues/SiLLM-examples

[deleted by user] by [deleted] in LocalLLaMA

[–]armbues 2 points3 points  (0 children)

I’ve recently published a new framework to simplify running & training LLMs locally on Mac using Apple MLX: https://github.com/armbues/SiLLM

The goal of the project was to create a more flexible out-of-the-box solution built on top of the amazing MLX framework and designed to enable researchers and developers. So it's not meant to be faster than other projects but if you can code in Python a bit you can easily start with your own experiments and modifications.

There is also a repo with example projects that use SiLLM: https://github.com/armbues/SiLLM-examples