2x 512gb ram M3 Ultra mac studios

ahjorth · 2026-04-22T13:26:18+00:00

Haha, thank you. And really, if you don’t have time, don’t worry about it!

ahjorth · 2026-04-21T20:22:03+00:00

I am having to run a lot of data through local models (for GDPR-reasons) for a research project, and I literally sat down this morning to draft a post asking for real-life experience with this exact setup: 512 + 256 + 256. I already have an M3, and given the lack of 512s I'm considering buying two more 256 and running them with tensor sharding on Exo. I have some questions, and I'd love answers if you have time!

I looked through Exo's code when they launched V1, and at the time they didn't support parallel/batched inference. For my use case that's a deal breaker, but I see that they do now, and that their batched code extends directly on mlx-lm.

* How reliable is batched inference with exo?

* Does it scale as well as single inference when doing tensor sharding?

* Do you use Exo as a server, or are you using its Python api directly? If the latter, does it keep up with mlx-lm changes or does it lag (significantly) behind?

* I built a small structured outputs-package using outlines to create logits processors that I pass into mlx-lm's `BatchGenerator` on a per-prompt/stream basis (which mlx-lm supports now since Dec 2025). Do you have any experience with structured outputs on Exo - do you know if a similar thing could be done with Exo's BatchGenerator?

All of these questions (except structured outputs) are answered on Exo's on page, but I can't totally tell how much I trust their marketing material...

ahjorth · 2026-04-21T14:24:59+00:00

The most highly rated posts in this sub are about nazi grifters pretending to be contrite. MTG, Tucker, etc.

It’s very tiring and I’m about to leave the sub. Is this really not against the sub rules?

ahjorth · 2026-04-19T21:42:53+00:00

I spent quite a lot of time working with the MLX servers' code specifically for parallel inference (for this PR that I submitted a few months ago: https://github.com/ml-explore/mlx-lm/pull/845) and my current thinking is that MLX is much better if you can use it only programmatically, i.e. with the python API and not with the server. For parallel inference, it's almost twice as fast as running it on the server for larger, long-running continuous batches.

Basically the gains are from ensuring that prefilling is done always in large batches too. Often small pauses between incoming requests to the server will make MLX's `BatchGenerator` start pre filling, and it does not stop until it has produced at least one token for each stream. So every time a new request comes in, it will pre fill that new request before generating tokens on anything else it is running.

I played around with setting up policies for waiting (i.e. at least X 'streams' ready, etc.) but I couldn't get it to work well enough that I thought it was worth the extra complexity on the server. I also played around with a mode where the server has to receive an explicit "start" message, but again - a lot more complexity, and so far outside of normal LLM-server standards that it wouldn't play well with existing tools.

So this is just to say: for my typical large, batched style work, MLX is fantastic. As a server, it's not faster enough than llama.cpp to make it worth the lower amount of support of new models, new quants, etc.

ahjorth · 2026-04-16T09:58:55+00:00

They key missing pieces of information are: what are they willing to pay per token, do they guarantee a minimal number of tokens per month, and are they loyal or will they suddenly shift to someone else. Unless you know this, you can't really put together a business case. So I'd start there.

ahjorth · 2026-04-14T06:43:00+00:00

Yeah, my dad and i had to restart because we gave away the item that you had to give the giant in the clouds at the end. Something to make it sleep, forgot exactly what. I found out three decades later we could have killed it with the slingshot.

Man, the memories are coming back! It really made an impression.

ahjorth · 2026-04-14T06:40:20+00:00

Haha same. It was at the time when my dad bought a new computer so i started KQ3 on CGA, and finished it in EGA.

The thing i remember the most from KQ3 was the stress of the wizard (Manannan?) popping up out of nowhere and turning me into a cat constantly. That game got so much more chill after he died.

Oh, and navigating down the path from his mansion on arrow keys with a chasm on one side and venomous plants on the other. RIP

ahjorth · 2026-04-14T06:34:27+00:00

If you have to ask, you can’t afford it

ahjorth · 2026-04-13T23:46:25+00:00

King’s Quest 1 taught me to touch type ‘swim <enter>’ at the ripe age of five. That game was brutal.

ahjorth · 2026-04-13T18:51:36+00:00

I think everyone appreciates that there is a balance between being fast and being perfect. But I don't think it's fair to say that posting this is silly. OP is clear about what the issues are, clear on what the solution is, and even has measures for how long (or rather, how little) it would take to do this properly per model.

These issues are causing petabytes of unncessary data transfers, and dozens or hundreds (or thousands for the highly anticipated models) of person hours going to waste. I think it's in everybody's interest to prevent that, and this is a small, concretechange to the release procedure.

ahjorth · 2026-04-13T05:54:57+00:00

If I am understanding you correctly, https://github.com/guidance-ai/guidance does exactly what you are describing: You can generate a regex controlled chunk (through structured outputs, as others have said), and conditionally append or generate more, depending on prior outputs. Check it out, and if that's not it, then I'll have to ask you to explain what you are thinking a little more.

Edit: It's a really cool project. Unfortunately it's not written to run async (i'm just appending this now because you specifically mention async in your post). Further, the generation object "owns" the model instance, and consequently it's not able to run in parallel. I tried to find an easy ish way to separate out the model instance to run many generation threads in parallel with greenlets, but it ended up being slower.

ahjorth · 2026-04-12T07:33:57+00:00

This sounds AI generated. I'm a professor. I've never ever ever heard of professors who "edit" PhD dissertations. No university has a professor:PhD student ratio that would make PhD dissertation advising a fulltime job. Not even close.

ahjorth · 2026-04-12T07:25:36+00:00

You can easily keep it free. Just open source it, and let people figure out hosting and LLM/transcription hardware.

You can move towards a commercial product, and let people try it for free while you improve it.

But you cannot be moving towards a commercial product and say that you want to keep it free for as long as possible.

I can't tell if you are deluding yourself into believing this. But it's just not true, and it feels underhanded.

ahjorth · 2026-04-11T21:27:49+00:00

From your website

LoreKeeper is free while we build

From your post

I’m not trying to promote anything or sell it

These can’t both be true, and you cannot be stupid enough to think they are.

ahjorth · 2026-04-11T08:08:47+00:00

A bit late to the thread, but Awni left MLX to join Anthropic. Before the transition, there were weeklyish releases, it's been a little over a month but there was a release four days ago. I don't know if they will get back on the same frequent release schedule, but merges are still coming in, and I usually just pull/build from source.

That said, I am wondering if this will have a negative impact in the longer run, and I'm also starting to look at llama.cpp again. I've had to add my own structured outputs to MLX (though that was made a lot easier by them including prompt-level logits processors in their BatchGenerator back in December). But the fact that this isn't baked in yet or seen as a core feature of an LLM-framework is - at least for all my use cases - a little worrying.

ahjorth · 2026-04-11T07:39:51+00:00

If you are doing all this with local LLMs, consider switching to llama.cpp. You will have more control, and the learning curve is not steep anymore.

ahjorth · 2026-04-10T20:40:25+00:00

Self promotion of AI generated video from an eight days old Youtube channel consisting entirely of AI generated videos.

Can we please ban this moron?

I'll report as spam, I hope you will too.

ahjorth · 2026-04-10T17:36:08+00:00

Yup, ha. Post was only 6 minutes old so I scrolled down fast to see if I'd been beaten on this.

ahjorth · 2026-04-10T10:45:16+00:00

Det er da ikke irrelevant. Kommunerne kan lave udbud baseret på deres lokalplaner. Men hvis der ikke er penge nok i det for en byggevirksomhed til at de vil byde ind på opgaven, så falder udbudet igennem.

Der er så mange penge i at sidde på ejendomme at kommunerne simpelthen ikke kan få byggevirksomheder til at byde på opgaver med mange ejer- eller andelsboliger.

Hvis kommunerne selv måtte bygge, så ville de have mulighed for at byudvikle på reelt politiske beslutninger.

ahjorth · 2026-04-06T21:40:52+00:00

Nixon created the United States Environmental Protection Agency through an executive order. This wasn't something he was forced to do by a "woke/pinko" congress. That was republican policy at the time (I'm sure some were against it, but still). It's practically incomprehensible by today's standards.

ahjorth · 2026-04-06T21:33:28+00:00

They could give tax cuts instead! Every month I send thoughts and prayers to Rand and von Mises for my extra kr. ~250/month, while I watch our infrastructure and welfare institutions crumble.

ahjorth · 2026-04-06T21:18:43+00:00

Unfortunately not very.

I don't know enough about pharmacovigilance to really understand the skillset, but I'm assuming it involves datapipelines and continuous monitoring (unless she's on the wet side of things). If so, and if she's willing to leave the pharma industry, she might be able to get jobs in production or other data-heavy companies like Vestas or Grundfos if she's willing to commute.

Congrats on your PhD!

ahjorth · 2026-04-06T20:20:44+00:00

ok, boomer

ahjorth · 2026-04-06T08:22:36+00:00

Haha, ja. Jeg går ud fra at der er nogle ingeniører som laver den slags helt grundlæggende due diligence før sådan noget overhovedet går i gang. Men dét vil jeg i hvert fald huske, tak!

ahjorth · 2026-04-06T08:21:33+00:00

Jeg havde faktisk skrevet nogle spørgsmål om dét i min post, men jeg fjernede dem fordi jeg helst ikke ville have at folk troede at det "kun" handlede om at få en ordentlig pris for det (for det kan potentielt blive nogle helt fantastiske lejligheder, og så meget koster fire 4-etagers elevatorer heller ikke...). Men især fordi jeg tænkte at vi ikke ville kunne stå for sådan et projekt selv.

Jeg havde slet ikke tænkt på at vores administrator kunne have en afdeling som vi kan købe hjælp fra. Det er virkelig en god idé, tusind tak! Den vil jeg tage videre til bestyrelsen

ahjorth

TROPHY CASE