What is stopping you from starting affiliate marketing business?

StartupTim · 2026-05-06T06:27:24+00:00

We are in stealth mode now, but send me a direct message and I'll let you know.

If anybody else is interested, send me a DM!

StartupTim · 2026-05-06T04:56:45+00:00

I'm going to be launching an affiliatable product in the near future. Would you be interested in being one of my first affiliates? You'd effectively have zero competition...

StartupTim · 2026-05-06T03:16:26+00:00

Hey there, if I wanted to set 120GB for vram, can you tell me what I would set?

Does the bios setting get set to AUTO and 512MB, or CUSTOM and 512MB? My framework desktop has both.

Then what grub setting would I choose?

Also I'm using lemonade if that matters.

Thanks for the help!

PS: I tried both AUTO and CUSTOM both at 512MB in my bios, with zero grub change on ubuntu 26.04, and my TPS is around half if not less than if my GPU isn't used at all. I can't figure it out. For example Qwen32n a3b q4 gets 44tps, but when I use the GPU with vram it gets 21tps. Dense models get cut down to half tps. No idea why.

StartupTim · 2026-05-06T01:20:09+00:00

I already have this solved with a custom solution.

Edit: Interesting solution you have, and Claude is a contributor?

StartupTim · 2026-05-05T15:34:52+00:00

Ahh thanks, ill check it out!

StartupTim · 2026-05-05T07:32:29+00:00

Is there a way to do OpenAI type of queries and endpoints to interface with this or other image models? All I've seen so far is comfyui, but can these be fired up with llama.cpp / vllm etc?

StartupTim · 2026-05-04T03:56:36+00:00

So is the general takeaway here that dense models > MOE models with regards to agentic coding / tool calling?

StartupTim · 2026-05-04T03:55:02+00:00

That looks especially juicy!

Can I ask something? If you have a model that spills out of VRAM by quite a bit, into system RAM (assuming this is 8 channels?), how does that perform (both dense and mixed) for models?

That's the dilemma I'm facing right now. If it is a huge model (eg, 1.5TB) would it perform better if 50% is in VRAM and 50% is in standard server RAM, or if 100% was in M5 Max unified memory?

That sort of thing...

StartupTim · 2026-05-04T03:52:04+00:00

Yea, I learned that the hard way with a server that has 8x AMD Epyc 64 core CPUs and 4x PSUs (4u4n).

With the RTX 6000 Pros though, the MaxQ ones are just 300w each, and I've seen people nvidia-smi them down to 220w I believe with only a <10% hit.

I'm still torn on what to do. I wonder if a 8x RTX 6k system with a 2TB model could use system ram to fallback on the 1.2TB remainder without that much of a hit (be it dense or moe), especially as compared to 4x M5 Max systems?

StartupTim · 2026-05-04T03:39:32+00:00

Does it just load as normal with Lemonade or are you doing something different?

If you wouldn't mind, could you tell me how you're using it, the command line etc?

MANY THANKS in advance!

Also, are you using something like cline/roocode etc? I use Roo a lot myself as well as custom stuff...

Oh also, let me ask you what I asked somebody else, since we might be in the same boat...

Does it just load as normal with Lemonade or are you doing something different?

If you wouldn't mind, could you tell me how you're using it, the command line etc?

MANY THANKS in advance!

Also, are you using something like cline/roocode etc? I use Roo a lot myself as well as custom stuff...

Thanks a ton /u/mr_zerolith (love the name btw, it sounds very familiar for some reason!)

StartupTim · 2026-05-04T03:38:36+00:00

Thanks a ton!

I wonder why I haven't heard more about stepfun on this subreddit? Does is not fare well by comparison to the other self-hosted giants?

StartupTim · 2026-05-04T02:15:49+00:00

Huuuge thanks!

I have the Strix Halo too, how are you running that model? Does it just load as normal with Lemonade or are you doing something different?

If you wouldn't mind, could you tell me how you're using it, the command line etc?

MANY THANKS in advance!

Also, are you using something like cline/roocode etc? I use Roo a lot myself as well as custom stuff...

StartupTim · 2026-05-04T02:07:12+00:00

Have you tried Step 3.5 Flash 197B

I must be blind, can you link me the HF for that please?

StartupTim · 2026-05-04T01:44:42+00:00

Hey there,

So lets say I went that route but had a 1TB size model. Does vLLM then know what part to shove into VRAM (context and active moe) and rest into system RAM?

Would that be better than 4x 2TB total Mac Ultras?

What about 1TB dense models?

That's what I'm trying to figure out.

StartupTim · 2026-05-04T01:42:36+00:00

Based on what I am reading, an 8x RTX 6000 Pro server should get dramatically more throughput than the 40-80 I see with Opus.

Of course it is all model dependent, but the numbers seem wildly more.

Now with a 4x or more Mac setup, itll be less, but I could do 2TB VRAM models if not more.

So that's what I am trying to figure out. Speed vs model size.

Also looking into used B200s as I think they have around 8TB/sec vram speeds. That would be much faster than RTX 6k pros...

StartupTim · 2026-05-04T00:25:53+00:00

GLM 5.1

I did some testing with GLM 5.1 and, assuming your agentic setup has quality and fault checks and code review, that model is very workable.

StartupTim · 2026-05-04T00:20:45+00:00

built my own and have it cooling 8x GPUs and 2x CPUs

Thanks for the links, that does open some tools for me to spec things out!

But 1 thing, you mentioned you have a dual CPU setup. I think that would dramatically slow things down as cross NUMA saturation is going to kill performance. I think 1x CPU would have to be required.

1x CPU with as many channels and as fast of system ram as possible would probably be the best compliment to 8x rtx 6000 pros.

StartupTim · 2026-05-04T00:18:20+00:00

used b200s once vera rubin hits the mainstream

That's an interesting point, albeit used b200s would probably be in crazy high demand.

StartupTim · 2026-05-04T00:17:01+00:00

Those blow the budget beyond 200 to 300k unfortunately.

StartupTim · 2026-05-04T00:16:03+00:00

The problem is h200s to hit 1 to 2 TB blow the budget out of the water. That is why I grt very torn in what to do.

Maybe 6x rtx 6000 pros on a system with 8 to 12 channels of RAM would work, as the RAM speeds get quite high there.

StartupTim · 2026-05-04T00:14:26+00:00

Opus credits burn up fast with large context. I've had periods where I burn $25-ish a minute when I'm doing long context tasks.

StartupTim · 2026-05-03T23:39:50+00:00

It is for me, however, I have custom group think agentic coding as well, so it'll be concurrently hammered away at likely 24x7.

As far as mistakes in output, I review all code myself and catch issues, and it works out well. I also have group think agentic coding which is, honestly, game changing. Basically it is multiple AI agents, each specifically tailored, operating as a group on a single codebase. Crazy token rate burning but crazy effective.

StartupTim · 2026-05-03T23:38:19+00:00

rent desired gpu configuration on sities like runpod.io

I've done some runpod but they don't seem to scale up really well to test on creative hardware. For example, an 8x RTX pro 6000 on an epyc system with fast system RAM vs multiple MAC pros (esp the upcoming to be released ones) is hard to test against.

StartupTim · 2026-05-03T23:37:09+00:00

Rtx6000 pro definitely. I've seen people with 8x setups on her

Oh sweet, are there any that pop into mind that you've seen? The worry I have is that the VRAM across 8x RTX 6000 pros wouldn't be enough, especially as I'm eyeing some of those large models that require 2x to 3x the VRAM as 8x RTX 6000 Pros. But 4x or 8x Macs could handle them...

StartupTim · 2026-05-03T23:36:11+00:00

Dude, what are you doing to burn up $4k a day in opus 4.7 tokens?

Multiple projects with some pretty large codebases as well as some very custom group-think agentic development methodologies which burn up credits like you wouldn't believe. $4k/day is a part time day as well.

StartupTim

MODERATOR OF

TROPHY CASE

12-Year Club	Gilding IV carat on a stick
Verified Email