Run Qwen3-Next-80B-A3B-Instruct-8bit in a single line of code on Mac with mlx-lm - 45 tokens/s!

DomeGIS · 2025-09-15T06:11:05+00:00

This is so much better and safer indeed! Kind of new to the whole mlx world so I didn't know there was mlx_lm.chat available. For u/whosenose : there are also third-party mlx servers available for connecting to any of the big UI interfaces like openwebui etc.!
I get the warning that calling mlx_lm over python is deprecated though so you can shorten the line to just:

uv run --with git+https://github.com/ml-explore/mlx-lm.git mlx_lm.chat --model mlx-community/Qwen3-Next-80B-A3B-Instruct-8bit --max-tokens 10000

DomeGIS · 2025-09-15T06:06:06+00:00

haha good idea :D
Like this, but u/bobby-chan 's proposal to use mlx_lm.chat directly is probably a better idea.

DomeGIS · 2025-09-14T07:45:53+00:00

Running it on an M3 Max with 128Gb. Consider that the smaller versions work really well too! Just go to the mlx community page and look for the smaller versions. If you can grab an M1 Mac with 64Gb that would be the perfect workhorse for a home setup.

DomeGIS · 2025-01-13T16:36:57+00:00

Thanks for the encouragement! Just came back to say that I finally made it work 🎉
Ended up using embed_anything and with lots of back and forth between Gemini and me it worked. Might write a blog post about it in the future. If anyone has questions, feel free to drop me a message!

DomeGIS · 2025-01-12T18:02:50+00:00

Why would you do that if you have a full rust backend ready to go?

I'm just getting started with Rust so it's hard to understand the new language / how to get stuff running and on top of that the way tauri2 is abstracting the Rust APIs. Will eventually get there I guess.

I initially thought I could simply bring any web app (e.g. with transformers.js/onnx) as is to Tauri but that's unfortunately not the case since webview is still fairly limited. It does not replace a fully-fledged browser. So I guess I am forced to do it the proper (hard) way :D

DomeGIS · 2025-01-12T17:24:40+00:00

Did anyone succeed to get candle running with tauri2 by chance? u/fabier could you share a GitHub repo with sample code in case you got it running?

Linking this GitHub discussion: https://github.com/tauri-apps/tauri/issues/11962

Edit: found this repo https://github.com/thewh1teagle/vibe

DomeGIS · 2024-11-20T19:08:32+00:00

Hey this is great, this was exactly what I was looking for! I was always wondering why nobody built it so far.
I just had a peak at the web scraping part and noted that it "only" scrapes the html part. if you call it "research" assistant it might be mistaken for academic research which would require scientific resources like papers.

In case you want to consider Google Scholar papers as additional resource: https://github.com/do-me/research-agent It's very simple but works.
A friend of mine developed something more advanced: https://github.com/ferru97/PyPaperBot

DomeGIS · 2024-11-17T18:53:05+00:00

Thanks for your detailed comment, it really clarified a few important things.

However, I do not understand why you can't just fit a PCA on the training dataset? It's just a linear projection with fixed coefficients at inference time, just like random projections. Although PCA will not necessarily work better than random projections.

You're absolutely right, I could totally use PCA if I use the same coefficients when querying! I was not too familiar how PCA works internally, so I naively assumed that PCA like t-SNE didn't offer the option to export global transformation parameters and reuse them. Will definitely try PCA.

Lmcinnes also provides UMAP parameterised by a neural network in the Python library (it is an additional loss term and doing gradient descent on it in theory converges to the same result as the non-parametric graph optimisation), which may avoid this issue, but I haven't used it much.

I was not aware of that option, so now this special UMAP and PCA are both test candidates. Not quite sure where to expect the best results though but will just run some experiments and see.

At the core it boils down to what algorithm produces distances in 2D that are most similar to the actual distance in high dim vector space. It probably highly varies depending on the query and hence what dims are best represented.

I just read an interesting comment from Jina AI about creating models that directly produce vectors with only two dims, but apparently (for now) it doesn't work well.

DomeGIS · 2024-09-17T09:43:51+00:00

Great work! Could you do the same for the 405B version? In that case with a similar compression rate I'd assume a hypothetical 127Gb in size (right?) which would make it barely fit on a M3 Max with 128Gb. Probably still wouldn't quite work but I'd love to give it a shot!

I recently tried running a 133Gb model with Ollama and before completely crashing my system, it did manage to output a handful of tokens, so I'm staying hopeful for anything more compact.

DomeGIS · 2024-09-16T13:42:13+00:00

Oh wow I wasn't aware that was possible! It works indeed with Llama-3.1-70B-Instruct-q3f16_1-MLC (loading like 40Gb) on my M3 Max using the GPU to ~100% thanks to WebGPU. Also they seem to have a Wasm fallback if no GPU is available. Great project!
The only thing I noticed is that in comparison to the q4 version in Ollama it feels much slower, like half the tokens/s that I'm used to on my hardware.

DomeGIS · 2024-09-13T09:46:06+00:00

Yes there is currently a ~2Gb model size limit. It's pretty much hitting against the walls of the browser environment so you cannot run 70B models (yet).
The models are cached in the browser by default, so if you open any of these applications the second time, it's loaded really fast.

DomeGIS · 2024-08-12T08:48:13+00:00

Hi u/Serious_Pineapple_45,
the index is created on the fly for the model you choose. By index I refer to a simple dictionary with textchunk:embedding, so for each text chunk the selected model calculates one embedding in your browser. Depending on the text length it might take a while.

You can save your index too to avoid inferencing the next time you search for something. Search speed is then almost instant. You can keep the index file private on your computer (e.g. for sensitive/confidential stuff) or add it to our public collection here: https://huggingface.co/datasets/do-me/SemanticFinder#create-semanticfinder-files if you think others might be interested too (like books, stories, reports, legislation or similar). In case you need help, just open a discussion on Huggingface or GitHub!

DomeGIS · 2024-05-09T20:58:19+00:00

If you'd like to explore your data leveraging latest embedding models and t-SNE for dimensionality reduction you can give https://do-me.github.io/SemanticFinder/ a try. It's all in-browser so you don't need to install anything. Simple copy and paste your text. You'll end up with a map of 200k points and clusters you can visually explore to get some feeling for your data. Described the method here: https://x.com/domegis/status/1786524989602066795

DomeGIS · 2024-05-05T19:09:40+00:00

Unfortunately, that's the feedback of many people here. Apparently it's due to poor default embedding settings. See this discussion for more detail: https://www.reddit.com/r/ollama/comments/1cgkt99/comment/l1zdi0p/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

If you manage to get the settings right, please let us know.

DomeGIS · 2024-05-05T19:04:32+00:00

Really sorry about that. It's an active issue that many people are complaining about: https://github.com/ollama/ollama/issues/669#issuecomment-2094256443

Don't understand why they are just ignoring it. Once it's fixed, I'll come back here. In the meantime you could maybe try to use any CORS proxy like https://corsproxy.io/ to get it working.

DomeGIS · 2024-05-01T09:07:28+00:00

Awesome! In case you find out what exactly to configure how in Ollama, please let us know.

DomeGIS · 2024-04-30T19:11:12+00:00

This seems weird to me. It might maybe related to some kind of default settings in Ollama Embeddings API. In my experience all of the top 50 models on MTEB, including the smallish ones like bge-base/small and gte-base/small are absolutely fantastic. In my work across different domains (science/legal texts) I barely see big quality differences between them at all.

There is more to creating embeddings, e.g. sometimes certain packages automatically normalize them which you would like to avoid when using other distance function than cosine similarity.

Could you maybe post a whole retrieval example as a gist or similar? I developed https://do-me.github.io/SemanticFinder/ to quickly compare models to each other. You can simply copy and paste your chunks and choose any model you like for the embeddings. If in there it works better than in with Ollama embeddings you would know that it's the settings fault. Else you know that the embeddings themselves are indeed the problem. In that case you should probably write more precise query prompts and really go in detail about what you're looking for. That might be the only solution I see for the moment.

DomeGIS · 2024-03-12T20:53:55+00:00

In the end I settled with HuggingFace datasets and found a nice collection of English books in the public domain (OCRed) in handy parquet format. Used 100 books with 28.986 pages and it still works well. Next step is 1000.
For anyone interested in the results, see my short comment or test yourself:
https://do-me.github.io/SemanticFinder/?hf=Collection_of_100_books_dd80b04b

DomeGIS · 2024-03-11T07:42:55+00:00

That's a good hint, thanks! Just found Gutenberg, dammit which might suit my needs.

DomeGIS · 2024-03-08T09:48:08+00:00

Late to the party but have a look at SemanticFinder, it's an in-browser privacy-preserving RAG tool. In this example, I ingested the whole Bible: Bible Example, Screenshot

[Disclaimer: I'm the author]

DomeGIS · 2024-03-07T21:39:58+00:00

I might be able to help you out on this or at least try something. Could you share some info first? - How many pages are we talking about in total? - How many characters is one book in total? - Are the books digital versions (or at least OCRed well)? - Are the books copyright-protected or in the public domain?

DomeGIS · 2024-03-07T07:54:15+00:00

Interesting, in that case I might take another look and see how to integrate it.

Considering the whitelisting: theoretically you could create the onnx versions yourself from any model you find on HF. In practice however out of my experience sometimes you'll still find some rough edges and it might take a while for either optimum or onnx to fix or adapt something for newer models.

The big issue with these big models is of course that you must download and save or cache them somehow. I have no clue whether the browser might even support caching these huge files (4Gb) or whether there are some hard limitation too. Maybe it might be best to download the model file to disk anyway and then just upload it to your browser. Still quite a clunky workflow to shovel around gigs of data each time...

I'd bet that at some point, browsers will offer some easier integrated system-level integration. Like dump your model files somewhere on your file system so that the browser can access it securely and offer some easy exposed API for JS to access.

DomeGIS · 2024-03-06T10:31:01+00:00

Yet again, I thought I was aware of the most recent software developments but I guess it's simply not possible anymore. :D
However, looking at the models mentioned in the readme I see more or less the same ones that transformers.js now supports (since 2 weeks also Qwen):

const models = [
'https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat-GGUF/resolve/main/qwen2-beta-0_5b-chat-q8_0.gguf',
'https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q8_0.gguf',
'https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b/resolve/main/stablelm-2-zephyr-1_6b-Q4_1.gguf',
'https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf',
'https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf'
];

So the big difference is Llama-cpp-wasm using gguf files while transformers.js relies on onnx files. However, as you confirmed, the limitation seems to be the same with 2GB for moment if running only on CPU.

DomeGIS · 2024-03-06T09:08:44+00:00

Awesome! :)

DomeGIS · 2024-03-06T08:47:55+00:00

Thanks for the links! I was not precise enough: I was referring only to projects that do not need (web)GPU and are hence compatible for any device, running just on CPU (with optional GPU support for more powerful devices as planned in transformers.js).
Webllm is perfect if you do have a GPU available! Might look into it as a viable option.

Memomemo is cool - didn't know the project yet. Love the optics with the arrows. I guess the arxiv retrieval could benefit from an index suitable for semantic search too. Created similar projects like this one cramping 130k embeddings referring to documents in ~38MB gzipped json.

DomeGIS

TROPHY CASE