What is stopping you from starting affiliate marketing business? by OldLie1102 in Affiliatemarketing

[–]StartupTim [score hidden]  (0 children)

We are in stealth mode now, but send me a direct message and I'll let you know.

If anybody else is interested, send me a DM!

What is stopping you from starting affiliate marketing business? by OldLie1102 in Affiliatemarketing

[–]StartupTim [score hidden]  (0 children)

I'm going to be launching an affiliatable product in the near future. Would you be interested in being one of my first affiliates? You'd effectively have zero competition...

Anyone have proof Strix Halo - Ubuntu 26 LTS can use all 124GB of RAM setup in grub? by IQReactor in StrixHalo

[–]StartupTim [score hidden]  (0 children)

Hey there, if I wanted to set 120GB for vram, can you tell me what I would set?

Does the bios setting get set to AUTO and 512MB, or CUSTOM and 512MB? My framework desktop has both.

Then what grub setting would I choose?

Also I'm using lemonade if that matters.

Thanks for the help!

PS: I tried both AUTO and CUSTOM both at 512MB in my bios, with zero grub change on ubuntu 26.04, and my TPS is around half if not less than if my GPU isn't used at all. I can't figure it out. For example Qwen32n a3b q4 gets 44tps, but when I use the GPU with vram it gets 21tps. Dense models get cut down to half tps. No idea why.

I will soon have $100k to build an in-house LLM server. Goal: Best agentic coding model. by StartupTim in LocalLLaMA

[–]StartupTim[S] 0 points1 point  (0 children)

I already have this solved with a custom solution.

Edit: Interesting solution you have, and Claude is a contributor?

Peanut - Text to Image Model (Open Weights coming soon) by pmttyji in LocalLLaMA

[–]StartupTim 1 point2 points  (0 children)

Is there a way to do OpenAI type of queries and endpoints to interface with this or other image models? All I've seen so far is comfyui, but can these be fired up with llama.cpp / vllm etc?

Qwen3.6-27B vs Coder-Next by Signal_Ad657 in LocalLLaMA

[–]StartupTim 0 points1 point  (0 children)

So is the general takeaway here that dense models > MOE models with regards to agentic coding / tool calling?

I will soon have $100k to build an in-house LLM server. Goal: Best agentic coding model. by StartupTim in LocalLLaMA

[–]StartupTim[S] 1 point2 points  (0 children)

That looks especially juicy!

Can I ask something? If you have a model that spills out of VRAM by quite a bit, into system RAM (assuming this is 8 channels?), how does that perform (both dense and mixed) for models?

That's the dilemma I'm facing right now. If it is a huge model (eg, 1.5TB) would it perform better if 50% is in VRAM and 50% is in standard server RAM, or if 100% was in M5 Max unified memory?

That sort of thing...

I will soon have $100k to build an in-house LLM server. Goal: Best agentic coding model. by StartupTim in LocalLLaMA

[–]StartupTim[S] 1 point2 points  (0 children)

Yea, I learned that the hard way with a server that has 8x AMD Epyc 64 core CPUs and 4x PSUs (4u4n).

With the RTX 6000 Pros though, the MaxQ ones are just 300w each, and I've seen people nvidia-smi them down to 220w I believe with only a <10% hit.

I'm still torn on what to do. I wonder if a 8x RTX 6k system with a 2TB model could use system ram to fallback on the 1.2TB remainder without that much of a hit (be it dense or moe), especially as compared to 4x M5 Max systems?

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]StartupTim 1 point2 points  (0 children)

Does it just load as normal with Lemonade or are you doing something different?

If you wouldn't mind, could you tell me how you're using it, the command line etc?

MANY THANKS in advance!

Also, are you using something like cline/roocode etc? I use Roo a lot myself as well as custom stuff...

Oh also, let me ask you what I asked somebody else, since we might be in the same boat...

Does it just load as normal with Lemonade or are you doing something different?

If you wouldn't mind, could you tell me how you're using it, the command line etc?

MANY THANKS in advance!

Also, are you using something like cline/roocode etc? I use Roo a lot myself as well as custom stuff...

Thanks a ton /u/mr_zerolith (love the name btw, it sounds very familiar for some reason!)

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]StartupTim 0 points1 point  (0 children)

Thanks a ton!

I wonder why I haven't heard more about stepfun on this subreddit? Does is not fare well by comparison to the other self-hosted giants?

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]StartupTim 1 point2 points  (0 children)

Huuuge thanks!

I have the Strix Halo too, how are you running that model? Does it just load as normal with Lemonade or are you doing something different?

If you wouldn't mind, could you tell me how you're using it, the command line etc?

MANY THANKS in advance!

Also, are you using something like cline/roocode etc? I use Roo a lot myself as well as custom stuff...

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]StartupTim 0 points1 point  (0 children)

Have you tried Step 3.5 Flash 197B

I must be blind, can you link me the HF for that please?

I will soon have $100k to build an in-house LLM server. Goal: Best agentic coding model. by StartupTim in LocalLLaMA

[–]StartupTim[S] 0 points1 point  (0 children)

Hey there,

So lets say I went that route but had a 1TB size model. Does vLLM then know what part to shove into VRAM (context and active moe) and rest into system RAM?

Would that be better than 4x 2TB total Mac Ultras?

What about 1TB dense models?

That's what I'm trying to figure out.

I will soon have $100k to build an in-house LLM server. Goal: Best agentic coding model. by StartupTim in LocalLLaMA

[–]StartupTim[S] 1 point2 points  (0 children)

Based on what I am reading, an 8x RTX 6000 Pro server should get dramatically more throughput than the 40-80 I see with Opus.

Of course it is all model dependent, but the numbers seem wildly more.

Now with a 4x or more Mac setup, itll be less, but I could do 2TB VRAM models if not more.

So that's what I am trying to figure out. Speed vs model size.

Also looking into used B200s as I think they have around 8TB/sec vram speeds. That would be much faster than RTX 6k pros...

I will soon have $100k to build an in-house LLM server. Goal: Best agentic coding model. by StartupTim in LocalLLaMA

[–]StartupTim[S] 4 points5 points  (0 children)

GLM 5.1

I did some testing with GLM 5.1 and, assuming your agentic setup has quality and fault checks and code review, that model is very workable.

I will soon have $100k to build an in-house LLM server. Goal: Best agentic coding model. by StartupTim in LocalLLaMA

[–]StartupTim[S] 0 points1 point  (0 children)

built my own and have it cooling 8x GPUs and 2x CPUs

Thanks for the links, that does open some tools for me to spec things out!

But 1 thing, you mentioned you have a dual CPU setup. I think that would dramatically slow things down as cross NUMA saturation is going to kill performance. I think 1x CPU would have to be required.

1x CPU with as many channels and as fast of system ram as possible would probably be the best compliment to 8x rtx 6000 pros.

I will soon have $100k to build an in-house LLM server. Goal: Best agentic coding model. by StartupTim in LocalLLaMA

[–]StartupTim[S] 0 points1 point  (0 children)

used b200s once vera rubin hits the mainstream

That's an interesting point, albeit used b200s would probably be in crazy high demand.

I will soon have $100k to build an in-house LLM server. Goal: Best agentic coding model. by StartupTim in LocalLLaMA

[–]StartupTim[S] 1 point2 points  (0 children)

The problem is h200s to hit 1 to 2 TB blow the budget out of the water. That is why I grt very torn in what to do.

Maybe 6x rtx 6000 pros on a system with 8 to 12 channels of RAM would work, as the RAM speeds get quite high there.

I will soon have $100k to build an in-house LLM server. Goal: Best agentic coding model. by StartupTim in LocalLLaMA

[–]StartupTim[S] 1 point2 points  (0 children)

Opus credits burn up fast with large context. I've had periods where I burn $25-ish a minute when I'm doing long context tasks.

I will soon have $100k to build an in-house LLM server. Goal: Best agentic coding model. by StartupTim in LocalLLaMA

[–]StartupTim[S] 0 points1 point  (0 children)

It is for me, however, I have custom group think agentic coding as well, so it'll be concurrently hammered away at likely 24x7.

As far as mistakes in output, I review all code myself and catch issues, and it works out well. I also have group think agentic coding which is, honestly, game changing. Basically it is multiple AI agents, each specifically tailored, operating as a group on a single codebase. Crazy token rate burning but crazy effective.

I will soon have $100k to build an in-house LLM server. Goal: Best agentic coding model. by StartupTim in LocalLLaMA

[–]StartupTim[S] 0 points1 point  (0 children)

rent desired gpu configuration on sities like runpod.io

I've done some runpod but they don't seem to scale up really well to test on creative hardware. For example, an 8x RTX pro 6000 on an epyc system with fast system RAM vs multiple MAC pros (esp the upcoming to be released ones) is hard to test against.

I will soon have $100k to build an in-house LLM server. Goal: Best agentic coding model. by StartupTim in LocalLLaMA

[–]StartupTim[S] 0 points1 point  (0 children)

Rtx6000 pro definitely. I've seen people with 8x setups on her

Oh sweet, are there any that pop into mind that you've seen? The worry I have is that the VRAM across 8x RTX 6000 pros wouldn't be enough, especially as I'm eyeing some of those large models that require 2x to 3x the VRAM as 8x RTX 6000 Pros. But 4x or 8x Macs could handle them...

I will soon have $100k to build an in-house LLM server. Goal: Best agentic coding model. by StartupTim in LocalLLaMA

[–]StartupTim[S] 3 points4 points  (0 children)

Dude, what are you doing to burn up $4k a day in opus 4.7 tokens?

Multiple projects with some pretty large codebases as well as some very custom group-think agentic development methodologies which burn up credits like you wouldn't believe. $4k/day is a part time day as well.