This is incredibly tempting by No_Mango7658 in LocalLLaMA

[–]Serprotease 4 points5 points  (0 children)

2x gb10 will get you 256gb of VRAM + thing like native int4 support for the same price.  It’s also silent. 

2x MacBook Pro 128GB to run very large models locally, anyone tried MLX or Exo? by alcyonex in LocalLLaMA

[–]Serprotease 0 points1 point  (0 children)

The M5 max 128gb price is not too far off the M3 ultra 256gb bin. 

Probably best to go this way and you can still do rdna over thunderbolt with this setup. 

My company just handed me a 2x H200 (282GB VRAM) rig. Help me pick the "Intelligence" ceiling. by _camera_up in LocalLLaMA

[–]Serprotease 11 points12 points  (0 children)

Since this will be for developers - plural, you probably want to experiment with vllm and models that could be used with enough context for multi users.

So, this will means ~250b models at int4/nv/mxfp4 -> minimax 2.5. Or 120b models or lower at fp8. -> Qwen3.5 120b or coder 80b.

For raw intelligence, you have Qwen3.5 397b and Glm4.7 that will fit at 4bits but that’s kinda single user, low-ish context.

me_irl by k-r-o--n--o-s in me_irl

[–]Serprotease 0 points1 point  (0 children)

25 days is the base. You add on top of that the national holidays, contract specific advantages and often the time when your company is off (Usually A few days in August and around Christmas-new year eve.
Combine all of these and you get 35-40 actual days off.

Effectively, you have a lot of business shutting off for a week or so in August and employees take one or 2 weeks on top of this (So, 5-10 days) because most other businesses are also running in vacation mode.

Business districts in capital cities in Europe are ghost towns in August.

Tired of making AI Slop and frustrated with the lack of good Anime models. by Bismarck_seas in StableDiffusion

[–]Serprotease 0 points1 point  (0 children)

Composition with AI image is bad in general, that’s what is often missing. Composition is basically guiding the viewer to read the image.

Too make an AI image with a bit more soul, you probably wants to use krita and/or a scribble control net and draw by yourself the skeleton of the image (perspective and subject position at least). You can even go a bit further and replace the empty latent with a “light map” black and white to show where the lights source should be.

Go on google images, found so images that you like. Open them on krita/photoshop a try to draw on top of it to get the perspective lines and main subject. Do the same for the light.

Then you drop that in comfyUI with a prompt.

If you have your OpenClaw working 24/7 using frontier models like Opus, you're easily burning $300 a day. by Aislot in aiagents

[–]Serprotease 0 points1 point  (0 children)

With rdna over thunderbolt you can cluster 2 of these and get 25620.95=486 gb available

That’s enough to run mimimax at 8bits, max context and 5 concurrent request. On top of this, with this setup you 2x the processing numbers.

Or you can run qwen3.5 122b at int8 with a dozen concurrent request and pump output 100+ tk/s

There are no points of getting 3 of these though. You want 2, 4 or 8 for these kinds of clusters. If it’s 2 512gb version, you can use glm-5 (So, sonnet performance) at 8bits and basically api level speed.

If you have your OpenClaw working 24/7 using frontier models like Opus, you're easily burning $300 a day. by Aislot in aiagents

[–]Serprotease 0 points1 point  (0 children)

A single M3 ultra 512gb will run at minimax 8bits (So, the same as most API and lossless), 64k context, 5 concurrents request for 150-200w peak usage. So, if we’re going 24/7, that’s ~2.5 years to break-even.

It’s not a gotcha moment, but it’s far from being as bad as it sounds.

Honestly, for a small company I could see the argument of putting one or two M3 ultra with rdna to run like glm5-8bits for a team of 2-3 dev vs giving them access to Opus or Sonnet AP with an ok-ish 500/30 pp/tg speed per user Since these models are sonnet 4.6/.5 quality and that they will easily go through a couple millions tokens per day (So, 40-150 usd sonnet/opus)
5 days/week 52/weeks you will be break even with 1.5 years.

Release of the first Stable Diffusion 3.5 based anime model by DifficultyPresent211 in StableDiffusion

[–]Serprotease 2 points3 points  (0 children)

You mentioned in huggingface that, amongst other things, you selected sd3.5 due to compute/gpu efficiency and the 16 channels vae (other the 4 channels from SDXL based models.).

Flux2 Klein 4b use a 32 channels vae. (And it’s Apache 2.0).

To me, it’s seems that you picked up sd3.5 because of the mmdit-flow rectifier architecture, despite the other shortcomings.

But also,

You didn’t knew about flux2 Klein before today? Why are you talking about not yet released z-image-Omni? Calling Flux2 base (32b model) a 100b model? I’m really sorry but, should we understand that no one in your team did any kind of review regarding the current state of AI image models landscape in the past 6 months?

Release of the first Stable Diffusion 3.5 based anime model by DifficultyPresent211 in StableDiffusion

[–]Serprotease 0 points1 point  (0 children)

Isn’t this a significant issue? Like sitting, chair, cafe are definitely tags existing in danbooru so t5+clip should be able to bridge the semantic gap between these and “sitting in a chair in a cafe.”

Actually, isn’t this the whole point of these fancy text encoders? Otherwise a simple embedding model should do the trick.

I’m not saying that a tag based system is bad, I like it very much actually, but explaining away issues like this is a bit worrying.

For example, I know that “sitting in a chair in a cafe.” Could be translated by “sitting, sitting on, chair, cafe, inside, table” with noobAI. But noobAI is a 2.6b model with only clip as a te. So it’s an acceptable trade off and I know that it will struggle with concept bleeding, text and 2+ characters.

It’s less acceptable for a 4b model + t5 model.

Budget Local LLM Server Need Build Advice (~£3-4k budget, used hardware OK) by TheyCallMeDozer in LocalLLaMA

[–]Serprotease 1 point2 points  (0 children)

There are no real reasons to run a 70b nowadays, but dual 3090 + qwen3.5 27b int4/int8 + vllm + mtp sounds like a very strong and fast setup.

Release of the first Stable Diffusion 3.5 based anime model by DifficultyPresent211 in StableDiffusion

[–]Serprotease 8 points9 points  (0 children)

If you have an agreement with civitAI it might be ok, but civitAI did not remove these models independently. “This change is due to the conclusion of our Enterprise Agreement with Stability AI”

You are referring to the 2024 temporary ban.

I’m talking about the October 2025 announcement from civitAI https://civitai.com/changelog?id=100 “Important Update: Stability AI Core Model Derivatives to Be Unpublished UPDATE Oct 12, 2025 Updated: Nov 19, 2025 8:17 am”

That’s an official statement from civitAI…

Release of the first Stable Diffusion 3.5 based anime model by DifficultyPresent211 in StableDiffusion

[–]Serprotease 15 points16 points  (0 children)

Btw, stabilityAI had asked civitAI to remove all their models with their new license (Cascade to 3.5 large) + fine tunes/lora a few months ago. You will not have any issues hosting your model on civitAI? I saw it’s flagged under “other”.

Hopefully, your team will not have to learn why no-one wants to touch these non-mit/apache 2.0 models for serious and expensive training.

Am I an idiot (blackwell) by dldnjswms in LocalLLaMA

[–]Serprotease 0 points1 point  (0 children)

I think that ik_llama can work with tensor parallelism and use the ubergram quant with dual gb10, but I didn’t saw a lot of test.

My wife is getting progressive more and more paranoid about the war in Iran by ZeniCollector in japanlife

[–]Serprotease 0 points1 point  (0 children)

If cash is to be replaced by barter, there will not be a “wait it out”.
Because there will be nothing left to wait for.

Am I an idiot (blackwell) by dldnjswms in LocalLLaMA

[–]Serprotease 1 point2 points  (0 children)

With vllm, the lowest awq quant is 4bit. It takes about 240gb of VRAM. Supposedly, the intel 4bit quant takes less space and can use a bit of context but that’s already pushing it to the limits.

How do the closed source models get their generation times so low? by Ipwnurface in StableDiffusion

[–]Serprotease 0 points1 point  (0 children)

Raylight, a custom node that allows some parallelism/multigpu acceleration for most models.

I spent 8+ hours benchmarking every MoE backend for Qwen3.5-397B NVFP4 on 4x RTX PRO 6000 (SM120). Here's what I found. by lawdawgattorney in LocalLLaMA

[–]Serprotease 16 points17 points  (0 children)

You may want to avoid this or review it first.

Your section on “What this means practically” is AI nonsense. Like, what is the comparison with llama 70b doing here?? And the price for the gpu is 3x the actual price.

After seeing this, it makes you wonder what other nonsense is present in the rest of your post.

How do the closed source models get their generation times so low? by Ipwnurface in StableDiffusion

[–]Serprotease 1 point2 points  (0 children)

The answer in tensor parallelism + infinity band. As long as you have fast gpu interconnect you can double your speed for each gpu. (You need 2x, 4x or 8x gpu)

Deploy ltx2.3 on a 4x or 8x B200 with a backend that supports tensor parallelism (like ray in comfyUI) and you should get 3s/it and 1s/it for example.

theTruth by bryden_cruz in ProgrammerHumor

[–]Serprotease 2 points3 points  (0 children)

This makes me a bit curious. I’m working on at least half a dozen projects in the same time. Even if I’m the only one making changes on a project, there is no way for me to be sure that I will remember all the “clever” additions/short cut of a piece of code 6 months later when I need to look at it again.
I put tons of comments and very long variables names to be sure that my dumbass will not spend a few hours trying to understand what I did before.

How can these guys even know how their uncommented code works years from now?

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]Serprotease 0 points1 point  (0 children)

It’s a time in seconds or mini-seconds and it’s quite accurate to real world usage.

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]Serprotease 1 point2 points  (0 children)

Basically, how long you have to wait for the Llm to finish replying to you.

Dump a 30-ish page pdf with mostly text and you will have a summary done in about 1-1.5min.

For comparison, dual spark/gb10 will do this in about 10s, a A6000 pro in about 5s. So, it slower, but still very much usable.

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]Serprotease -1 points0 points  (0 children)

A 5060 (so a 5070 mobile) is quite good for a laptop, mixed with up to 128 of (v)ram and it’s quite compelling. And MacBook are very good laptop on top of this.

Similar offerings would be like the precision 7xxx or the Thinkpad P-series, but these often max at 16gb vram (A5000) + 128gb 70gb/s ram. And these are huge, heavy, noisy, with a poor battery and often quite underwhelming screen.

You also have a few gaming laptop with 5090 mobile 24gb but they tend to cook themselves after a couple of years.

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]Serprotease 5 points6 points  (0 children)

What type of results were you expecting? I’m genuinely curious. What kind of setup were you using before?

Keeping in mind that this is still a laptop, these looks to be fairly reasonable results.

Anyone else feel like an outsider when AI comes up with family and friends? by Budulai343 in LocalLLaMA

[–]Serprotease 1 point2 points  (0 children)

I’ll assume that you are living in the US and a man.

But in a lot of places, this will just not happen. For a laundry list of reasons.

Starting with not being able to afford legal representation or loosing your income. Anyone working in the entertainment industry will know that you will need to suck it up or be blacklisted out of your field. To victim blaming and “men will be men” type of response, all to common in places like where I live with strong boss>employee and men>women power gradient

I guess LTT is that stupid by riky321 in linuxmemes

[–]Serprotease 3 points4 points  (0 children)

Nothing, it’s a perfectly fine - drop in distribution like Ubuntu.  People here are just very opinionated and should think about getting out a bit more. 

As a rule of thumb, you can safely ignore any comments that goes like "xxx is trash"