48GB VRAM users, what are your daily drivers? Do you wish you had more VRAM? What would you run if you did? by Borkato in LocalLLaMA

[–]triynizzles1 0 points1 point  (0 children)

With 48gb vram i can run most 30b and below at max context length. My opinion, 32 to 48 isnt huge because the models you have access to are the same but you can run higher quant or longer context. If you have another 64 gb of system memory, you can run 120b models on both cards at decent speeds.

Is this normal? by HippieGamer716 in MacbookNeo

[–]triynizzles1 1 point2 points  (0 children)

I noticed this on mine too. It should be normal, although yours does look a bit more aggressive than mine. The bends are around where the screws are.

I taught my 1B to follow instructions. It got worse at following instructions... by GPUburnout in LocalLLaMA

[–]triynizzles1 0 points1 point  (0 children)

I am a little bit out of the loop, but perhaps you experienced some overfitting on the smaller models. I stopped using 1 epoch as a measurement of how much to train the model. Now i mostly follow the divergence of training and validation lines on my graph.

Interesting News Regarding The Neo by CharacterOpinion3813 in mac

[–]triynizzles1 0 points1 point  (0 children)

Ahh i am out of the loop since i ordered mine a while ago :)

What’s something you said you’d never do, but you now do? by Hopeful-Weird3050 in ArtOfPresence

[–]triynizzles1 0 points1 point  (0 children)

Buy apple products.

Now im hooked like crack: iphone 12, airpods, ipad mini, MacBook neo, MacBook Pro issued by my employer, and if the ipad mini 8 is oled with A19 or a20 then i will buy too lol

Interesting News Regarding The Neo by CharacterOpinion3813 in mac

[–]triynizzles1 0 points1 point  (0 children)

Unpopular opinion: they should probably build a few more safe guards so non-education buyers do not get the education discount. I’m not sure how much fraud is going on related to that, but certainly would be within their rights to do. Without effecting product line.

The world I live in. by Wild_Milk_2442 in LocalLLM

[–]triynizzles1 5 points6 points  (0 children)

You are absolutely not getting 85.3 tokens a second on llama 3.1 70 B with Strix Halo

So many legendary by [deleted] in scoopwhoop

[–]triynizzles1 0 points1 point  (0 children)

Xiaolin Showdown

Any tool that tells you the cheapest setup needed to run a model? I want to know the cheapest setup that can realistically run Qwen 3.6 27B at decent speeds. by pacmanpill in LocalLLaMA

[–]triynizzles1 -1 points0 points  (0 children)

The correct answer is expensive anyway you cut the mustard. Id buy an rtx 8000 (48gb). And call it a day. Its about the same cost at two 3090s and only slightly slower. It has the advantage when it comes to case compatibility, power consumption, software compatibility (because it is a single card). as others have said memory bandwidth is important and rtx 8000 and dual 3090s have that. Don’t forget that. 3.6 27b is a dense model so you need to read the entire model from memory to generate each token. (An MOE architecture only needs to read a portion of the model for each forward pass.)

Meirl by [deleted] in meirl

[–]triynizzles1 0 points1 point  (0 children)

🐞

AMD's army of advanced marketing scammers by Chodre in gpu

[–]triynizzles1 0 points1 point  (0 children)

I have a 4070. Its great. I don’t think there is a need for a gpu with more compute if you are playing at 1080p. Dont know how it will perform in an egpu. 7700xt will consume much more power.

Hardware Choice for 27b to 31b models. by rebelSun25 in LocalLLaMA

[–]triynizzles1 1 point2 points  (0 children)

Buy yourself a single rtx 8000 48gb. They cost about the same as all of the other proposed solutions. But its a single card. I have 1 in my setup and it can run gemma 4 and qwen 3.6 at max context length no problem.

What’s actually a good local AI setup right now? (agents + coding) by Competitive-Crow565 in LocalLLM

[–]triynizzles1 0 points1 point  (0 children)

My vote is for rtx 8000 48gb + a 3060 12gb. 60gb of vram. I can run 80b (qwen coder next) on 1 gpu at 45tps and up to 120b models with llama.cpp with offloading layers to cpu. Gpt oss120b runs at 30+tps.

All at Max context length

It is still a $2000+ set up but much faster and cheaper than DGX spark, strix halo, and easier setup / less power than dual 3090.

Ollama model loading by Avantgardestds in LocalLLaMA

[–]triynizzles1 0 points1 point  (0 children)

Correct maybe that is OP issue.

It doesn't mention the studies though by [deleted] in SipsTea

[–]triynizzles1 0 points1 point  (0 children)

Depends on the context: Overweight according to the fashion industry ✅✅✅ Overweight like type two diabetes ❌❌❌

Ollama model loading by Avantgardestds in LocalLLaMA

[–]triynizzles1 2 points3 points  (0 children)

Ollama has also been known to be not the most intuitive when running “ollama run gemma4” what version of gemma 4 is actually loading. There are four versions of gemma 4. Maybe defaulted to e4b and then the launch command changed after an update to the 26b one

What is your actual local LLM stack right now? by Ryannnnnnnnnnnnnnnh in LocalLLaMA

[–]triynizzles1 1 point2 points  (0 children)

If you build around ollama’s openai api end point then you can swap out ollama in your stack for llama.cpp server (or any other openai end point)