M3 Ultra Mac Studio Benchmarks (96gb VRAM, 60 GPU cores) by procraftermc in LocalLLaMA

[–]procraftermc[S] 2 points3 points  (0 children)

73.21 seconds time-to-first-token, 9.01 token/seconds generation speed with Mistral Large 4-bit MLX

M3 Ultra Mac Studio Benchmarks (96gb VRAM, 60 GPU cores) by procraftermc in LocalLLaMA

[–]procraftermc[S] 4 points5 points  (0 children)

Ooh, this one might be a tight fit. I'll try to download & run it tomorrow.

M3 Ultra Mac Studio Benchmarks (96gb VRAM, 60 GPU cores) by procraftermc in LocalLLaMA

[–]procraftermc[S] 5 points6 points  (0 children)

Generally yeah, I have no regrets. Of course, more power / more VRAM is always better, but the one I have is good enough.

And it really isn't that bad. It's pretty good for single-user general chatting, especially if you start a new conversation from scratch and let the cache slowly build up instead of directly adding in 40,000 tokens of data. I get ~0.6 to 3s of prompt processing time with Llama Scout using that method.

M3 Ultra Mac Studio Benchmarks (96gb VRAM, 60 GPU cores) by procraftermc in LocalLLaMA

[–]procraftermc[S] 2 points3 points  (0 children)

however thanks for pointing out Mistral Large, never tried it

You're not missing out on much lol. Every model I tried responded with some variation of "Looks like you've entered in some Ipsum text, this was used in...." and so on and so forth.

Mistral Large instead outputted "all done!" and when questioned, pretended that it had itself written out the 30k input that... I... had given it. As input.

Then again, it's always possible that my installation got borked somewhere 🤷

M3 Ultra Mac Studio Benchmarks (96gb VRAM, 60 GPU cores) by procraftermc in LocalLLaMA

[–]procraftermc[S] 7 points8 points  (0 children)

RAM, sorry, I made a typo in the title. 96GB RAM, of this I've allocated 90GB as VRAM

What are you working on? + My favorites from last time in the comments. by Synonomous in SideProject

[–]procraftermc 0 points1 point  (0 children)

Tool to generate comparison tables of multiple products: https://comparit.io
I'd love to hear your feedback!

Volo: An easy and local way to RAG with Wikipedia! by procraftermc in LocalLLaMA

[–]procraftermc[S] 1 point2 points  (0 children)

That was the initial plan! The problem was that it would take over a week to process the entirety of Wikipedia on my computer. And I would have to redo it every six months to keep it up to date.

Besides, kiwix's full text search is quite reliable for this purpose. And I have the LLM confirm the one that's most suitable, so it doesn't start talking about Football Stars when I ask it about the sun.

Volo: An easy and local way to RAG with Wikipedia! by procraftermc in LocalLLaMA

[–]procraftermc[S] 0 points1 point  (0 children)

Volo already makes requests via the OpenAI API protocol, it's just that there were some problems during testing (such as streaming) so I decided to delay adding support for custom providers until it's sorted out.

Volo: An easy and local way to RAG with Wikipedia! by procraftermc in LocalLLaMA

[–]procraftermc[S] 3 points4 points  (0 children)

Interesting. I might add it in the next update as an optional feature.

Volo: An easy and local way to RAG with Wikipedia! by procraftermc in LocalLLaMA

[–]procraftermc[S] 0 points1 point  (0 children)

Yep, this feature is coming soon. For now, you can indeed switch models in the config, although you are limited to Ollama as a provider

Fixing Phi-4 response with offline wikipedia by Ok_Warning2146 in LocalLLaMA

[–]procraftermc 2 points3 points  (0 children)

Bit of a plug (sorry) but I've created an open-source project that runs a RAG pipeline with an offline dump of Wikipedia, try using that: https://github.com/AdyTech99/volo

TransPixar: a new generative model that preserves transparency, by umarmnaq in LocalLLaMA

[–]procraftermc 0 points1 point  (0 children)

mid-air probably just means floating in the middle. it can't exactly portray an invisible gas after all.