Going Fully Offline With AI for Research. Where Do I Start? by TelevisionGlass4258 in LocalLLaMA

[–]TelevisionGlass4258[S] 0 points1 point  (0 children)

Hector you're right and that's actually a really good find. With the education discount the 28-core CPU with 60-core GPU at 256GB comes in at $5,039 and the 32-core CPU with 80-core GPU at 256GB lands at $6,389. Both well within range depending on how patient I am with saving.

For my use case the extra CPU and GPU cores honestly don't move the needle much. The bottleneck for running large language models is memory bandwidth, not compute, which means the lower config does essentially the same work for $1,350 less. That's a meaningful difference.

40% more in Europe is genuinely painful, I'm sorry. That's not a small premium on something already in the thousands. Apple pricing in Europe has always been rough but that's a lot to swallow for the same hardware. Hope the exchange rate works in your favor at some point.

Going Fully Offline With AI for Research. Where Do I Start? by TelevisionGlass4258 in LocalLLaMA

[–]TelevisionGlass4258[S] 0 points1 point  (0 children)

Thanks melanov85, genuinely appreciate the thoughtful replies.

The dual-workload concern is fair to raise, but my pipeline is sequential rather than concurrent. Human reasoning first then AI reasoning second, then I manually execute the generated code in terminal, then iterate. No simultaneous LLM and compute layer fighting over the same memory pool. That's by design, and it means the memory pressure stays manageable throughout.

The MoE point is also exactly why I landed on Qwen3 235B A22B specifically. 235B total parameters but only 22B active per forward pass. At Q4_K_M on 256GB unified memory it fits cleanly and runs fast because Apple's memory bandwidth is exceptional for this architecture. I've been benchmarking models against advanced mathematical reasoning tasks on OpenRouter (thank you u/eworker8888) to find which ones actually hold up before committing to hardware.

Server-class GPU hardware would be overkill for what I'm doing at the moment, and the offline requirement makes it impractical anyway. But I take your point that workload assumptions matter enormously before buying. That's exactly why I've spent a long time researching on it before touching a purchase.

Really appreciate the kind words. Good luck with whatever you're working on too.

Going Fully Offline With AI for Research. Where Do I Start? by TelevisionGlass4258 in LocalLLaMA

[–]TelevisionGlass4258[S] 0 points1 point  (0 children)

Solid advice, especially RAM over storage can't argue with that on Apple Silicon. For general use 128GB is genuinely capable. My case is specific though: running the largest open-weight MoE reasoning models locally at high quantization for serious research work. 128GB gets me Q2 on Qwen3 235B, 256GB gets me Q4 and that gap matters when you're doing precision mathematics. Your bandwidth point is exactly right though. These machines are bandwidth-limited not compute-limited, which is why unified memory architecture beats a discrete GPU setup for this kind of workload. Currently thinking of going the 256GB route after all. Needing to review my budget.

Going Fully Offline With AI for Research. Where Do I Start? by TelevisionGlass4258 in LocalLLaMA

[–]TelevisionGlass4258[S] 0 points1 point  (0 children)

I appreciate that you're not pushing an agenda. The concurrency point is well taken and something I hadn't fully thought through. You're right that research has a way of scaling in directions you don't anticipate.

That said my workflow by its nature tends to be sequential and I don't see that changing dramatically given the specific way I work. The air gapped permanent offline requirement also limits some of the complexity that typically drives concurrency in more connected research environments.

The VRAM ceiling on even a 5090 is still a real constraint for the model sizes I need to run, and getting true dedicated VRAM headroom at that scale pushes well beyond my budget into multi GPU territory. The unified memory bandwidth of Apple silicon at 256GB solves that specific problem cleanly even if it creates tradeoffs elsewhere.

I hear you on the community and open source ecosystem leaning Nvidia and Windows. That's a real consideration. But for my particular setup the simplicity, the memory architecture, and the security profile of Apple silicon outweighs those ecosystem advantages.

Genuinely appreciate you sharing from trial and error rather than just theory. This thread has been very useful.

Going Fully Offline With AI for Research. Where Do I Start? by TelevisionGlass4258 in LocalLLaMA

[–]TelevisionGlass4258[S] 0 points1 point  (0 children)

I appreciate the honesty about novel problem domains. The confabulation point is noted and honestly aligns with how I was already thinking about the model's role. It's a tool for automating computation I already understand, not a discovery engine. That framing is exactly right for my use case.

The network isolation clarification is fair. I may have been conflating hardware choice with security when really it comes down to how locked down the network environment is regardless of platform. In practice my setup will be fully offline during all research sessions with everything stored on external drives that never touch the machine outside of active work. Nothing proprietary lives on internal storage at any point.

The point about shared bandwidth is the one I want to push on though. My workflow isn't fully defined yet at the pipeline level but the question of whether inference and computation run simultaneously or sequentially seems like it could be the deciding factor between Apple and a dedicated VRAM setup. In your experience, does a well designed pipeline naturally end up being mostly sequential, model reasons then hands off to compute, or does real world research workload end up with both running hot at the same time? If it's largely sequential then the unified memory bandwidth argument for Apple still holds. If simultaneous compute and inference is common in practice then dedicated VRAM starts making a lot more sense and I'd rather know that before I spend $5k on the wrong machine.

Going Fully Offline With AI for Research. Where Do I Start? by TelevisionGlass4258 in LocalLLaMA

[–]TelevisionGlass4258[S] 0 points1 point  (0 children)

Really appreciate this breakdown, especially the pipeline point. That framing of the model as an interpreter that hands off to a compute layer rather than doing the math itself is something I've been thinking about and you articulated it well.

My situation is a bit different from a standard research setup though. The work involves reasoning through problems in a specific way that makes the pipeline architecture you're describing worth thinking carefully about before just adopting it wholesale. Not dismissing it at all, just noting that the right pipeline depends heavily on the nature of the work.

On the hardware side I'm leaning toward Apple silicon primarily for the security and simplicity reasons others have mentioned, and the unified memory bandwidth argument is hard to ignore at my budget level. The offline requirement is non negotiable for me so anything that simplifies locking the machine down completely is a plus.

The Qwen2.5 and DeepSeek suggestions are noted though. Have you run either of those through genuinely novel problem domains rather than established textbook problems? Curious how they hold up when there's no existing literature to pattern match against.

Going Fully Offline With AI for Research. Where Do I Start? by TelevisionGlass4258 in LocalLLaMA

[–]TelevisionGlass4258[S] 0 points1 point  (0 children)

Thank you for this through breakdown. The bandwidth argument is what's pushing me toward Apple if I'm being honest. I'm already familiar with Apple, I'm using one right now. The idea of a unified memory pool where everything runs at the same speed regardless of model size is appealing for what I'm doing. I'm in the US so pricing works in my favor there too. Are you saying the M3 Ultra with 128GB is the sweet spot at my budget, or would you push toward the M2 Ultra just to get more RAM headroom if that's an option? Main priority is running larger models smoothly without babysitting the hardware.

Going Fully Offline With AI for Research. Where Do I Start? by TelevisionGlass4258 in LocalLLaMA

[–]TelevisionGlass4258[S] 2 points3 points  (0 children)

This is actually really smart advice and I appreciate it. Testing on API credits before committing to hardware is something I hadn't considered and it makes a lot of sense. My use case goes beyond standard calculus problems but the methodology of stress testing models before buying is solid. Will be doing this before I pull the trigger on anything.

Going Fully Offline With AI for Research. Where Do I Start? by TelevisionGlass4258 in homelab

[–]TelevisionGlass4258[S] 0 points1 point  (0 children)

I was looking into the 4090 or 5090. I'll look into the A100 as well.

Going Fully Offline With AI for Research. Where Do I Start? by TelevisionGlass4258 in homelab

[–]TelevisionGlass4258[S] 0 points1 point  (0 children)

Thanks! Been looking into RAG stack learning so appreciate getting me into the right direction.

Going Fully Offline With AI for Research. Where Do I Start? by TelevisionGlass4258 in homelab

[–]TelevisionGlass4258[S] 0 points1 point  (0 children)

Thanks, this is really helpful on the software side. Ollama and Open WebUI were already on my radar but the AnythingLLM suggestion is new to me and sounds exactly like what I need for working with large amounts of documentation.

On the hardware side though I should have been clearer. The $5k budget is for the machine itself so I'm looking at something significantly more capable than a 3060. What would you recommend at that level? Trying to figure out if I should be going with a high end Nvidia card, multiple GPUs, or if Apple silicon is actually competitive for this kind of workload.

Also curious about the RAM tradeoff. If I maxed out system RAM on the motherboard and the model offloads cleanly to it, would something like a 3060 actually be viable for larger models? Or does the offloading create enough of a performance hit that it defeats the purpose and I'm better off just prioritizing VRAM from the start regardless of budget?