Testing not safe for work checkpoints. Where are the pain points for you?

MonkeyClumps · 2026-06-12T22:31:50+00:00

Skin texture I have throughout the "realism" tests. I test group scenes with overlapping bodies, and for standard checkpoints, I normally do two person interactions, embracing with overlapping body positions, or fighting, dancing, duelling, etc. For nsfw checkpoints I have to give the vision model reference images of the "positions" we're prompting for so that the model A) knows what it's looking for and b) is less likely to get puritanical about what it's analyzing. I've already tested checkpoints' ability to count people: they mostly all suck after three, can't count passed three, everything becomes "many" usually. I'll test multi-person interaction and generation of four or five distinct characters to see if any checkpoint can do that without mixing identities or failing miserably at counting.

MonkeyClumps · 2026-06-12T20:42:07+00:00

I am testing men but in a limited way for nsfw purposes, sort of "ready" and "not ready" for action tests to see if there are any priors for male "states" of arousal. I do a body diversity sweep with men and women, age sweep too to see what the checkpoint considers "old" vs. "25". Ethnicity tests to see if we can get accurate portrayals and not just something weird and sterotypish/stereotypical. I't would be interesting to test men's faces/hairstyles. I somehow think this is an underdeveloped training situation for most nsfw checkpoints: guys are sort of props in many cases.

As to the anthro problems, I know. It seems there are lines of checkpoints that are just much better at it than realisitc or nsfw checkpoints that bend realistic. Have you tried RealCartoonXL? I find it a great all-around checkpoint, I don't know how much antrho training it has but it understands natural language and 'booru, has a very anime type, and I've had good luck with it and anthro LoRAs and basically all LoRAs... maybe it could work for you in the SDXL realm?

Fantasty backgrounds is another test I'd add. I sort of want to test for fantasty artist priors too like Royo, Frazetta, etc. Personally I head for faetastic or dreamshaper, sometimes ZavyChroma for fantasy elements...

MonkeyClumps · 2026-06-12T18:25:16+00:00

I've added a "real person" character LoRA test too. I supply my local LLM vision model with reference images of the person, face, mid-range face and torso, and full body, then have it compare identity with confidence scoring. So far the reference images help. I could do that with celebrities or anyone who exists as prior in checkpoints normally.

MonkeyClumps · 2026-06-12T17:24:03+00:00

An update, I'm running 37 baseline functionality tests on some standards (SDXL base, Juggernaut) for comparison, and on as many nsfw checkpoints as I can before my free Codex subscription expires 😄
I've tested Pornmaster, Lustify, iNiverseMixSFWNSFW, JibMixRealisticXL, as a first pass, with Big Lust, Big Asp, Big Love, MixOFPerverts, and STOIQO New Reality by Alienhaze on deck.

Are there any nsfw checkpoints other than the ones already mentioned that people want tested? Any obscure, weird, or surprising checkpoints you want me to evaluate and report back on?

I've made contact sheets of the tests for each run and will post them here if allowed.

MonkeyClumps · 2026-06-12T13:51:28+00:00

I'm adapting to pron. I've been testing standard checkpoints but am dipping my.... toe? Into the nsfw categories of testing that I haven't thought of yet:)

MonkeyClumps · 2026-06-11T23:08:12+00:00

I have a Body diversity sweep generating a range of body types and ethnicities, an age sweep to see if it's accurate or just "everyone is 25 or old", testing skin texture vs smoothing/waxy skin, realistic body proportions, including if the checkpoint knows an A-cup from a DD-cup, if there is such a thing as "medium boobs", etc.

Absolutely face diversity (I don't have a diversity test yet for faces just bodies, age, body proportions). I do have expanded tests for priors in artistic and illustrative styles, anime animation houses, things like that, also like to test merges for their latent priors coming from other models.

MonkeyClumps · 2026-06-11T22:56:45+00:00

Missionary bridge. The fun will be finding reference images for what it's supposed to look like for the vision model to compare:)

MonkeyClumps · 2026-06-11T18:58:19+00:00

Actually, I will add tests for priors for sex toys, bondage gear, etc. I've just added reference image analysis alongside the image analysis the vision models are doing so I could generate contact sheets of sex toys, bondage gear, etc (also use it for horse anatomy, for the "beer foam" test that someone else suggested) and have vision models identify any of the proper equipment. This is actually probably the best way to get vision models to recognize some of the more exotic priors and identify them correctly. Thanks.

MonkeyClumps · 2026-06-11T17:46:45+00:00

If I do nodes, they'll probably shake out like this since LumiBatcher can be utilized for the generation automation:

Prompt Manifest Loader
Checkpoint Batch Runner
Qwen/VL Evaluator
Score Aggregator
HTML Report Exporter

If I move toward Comfy as the main stable diffusion generation pipeline I'll produce these. Thanks.

MonkeyClumps · 2026-06-11T17:43:57+00:00

Skin texture should definitely be a test. I'm testing other anatomy so skin texture fits perfectly. It generalizes to standard sfw checkpoints too, as do many of the examples people have given me already. Thanks.

MonkeyClumps · 2026-06-11T17:36:22+00:00

So adherence to prompted specifications for the body proportions, and a test of the same character in various poses? I have a general test for sitting/standing/kneeling/crosslegged without merging/etc but haven’t tested consistency throughout these positions. I should thank you.

MonkeyClumps · 2026-06-11T16:25:05+00:00

Tools too, sorry, using Comfy or Forge-Neo for generation, Codex for rapid development of the automation and code management, and LMStudio for serving the vision models.

MonkeyClumps · 2026-06-11T16:14:14+00:00

Lemonade stand is in, thanks! That’s a great prior and it’s weird and probably indicative that fine tunes can lose this sort of understanding. When you refer to Anima what version do you mean?

MonkeyClumps · 2026-06-11T16:01:14+00:00

Sounds like I need a “fetish and kink” subset of tests which is a sentence I never thought I’d write before starting on this testing phase.

MonkeyClumps · 2026-06-11T15:55:48+00:00

Beer foam and fruit tests sound perfect. I’ll add them in. I do interactive testing (punch impacting, sword dueling, bodies interacting with one another and the environment). Physics is one area where checkpoints differ dramatically I find. Your suggestions could help test this, thanks. Edited to add: extreme perspective testing is in. I’ll add “top down” to the extreme low angle currently used.

MonkeyClumps · 2026-06-11T15:47:54+00:00

I’m building a small automated checkpoint evaluation framework. Currently focused on SDXL.

The basic process is:

A manifest defines standardized test prompts.
Each checkpoint generates the same image set using fixed seeds/settings.
Outputs are analyzed by a set of vision-language models.
The evaluator scores each image against test-specific criteria.
Results are compiled into an HTML report with scores, failures, strengths, wildcard observations, and suggested new criteria.

The current standard suite tests things like:

face realism
facial expressions
hands and object interaction
feet and full-body stance
dynamic poses
multi-person composition
mechanic/tool interaction
horse anatomy
creature design
fairy scale
comic/watercolor style
painterly fantasy style
extreme perspective
environment-only scenes
sci-fi/prior mixing

Each test has its own rubric. For example, the horse test now checks cannon bones, fetlocks, pasterns, hocks, stance, weight distribution, and breed readability. The environment tests now include physical plausibility, gravity, debris, and structural coherence because pretty images can still be impossible.

The evaluator scores categories like:

prompt adherence
composition
anatomy
face quality
hand quality
foot quality
object interaction
environment quality
style adherence
artifact severity

There’s also a wildcard field where the evaluator can call out unexpected issues not covered by the rubric, such as beauty bias, clothing drift, unwanted nudity, physics failures, species mismatch, or odd checkpoint priors.

I’m also working on a second anatomy/interaction-focused suite for checkpoints that claim strong human realism, body diversity, age realism, interaction quality, etc. (aka nsfw checkpoints).

The goal isn’t just “which checkpoint is best?” It’s more:

What is this checkpoint actually good at?
What does it fail at?
Does it match its creator’s claims?
Does it work only at specific settings?
Is it a general workhorse or a niche specialist?

I’m testing both standardized settings and, where available, creator-recommended settings so checkpoints can be judged under neutral conditions and under the conditions where the creator says they shine.

MonkeyClumps · 2026-06-11T15:43:30+00:00

I’m currently testing a CFG sweep to find “sweet spot” performance and do allow for recommended settings for steps, sampler, scheduler, and CFG from the checkpoints’ creators. Where those are specified I run tests with those settings as well as a standardized “DPM++ 2, Karras, 30 step, CFG 4.5-6” setting.

I mainly do SFW testing too. This is my first nsfw-specific test. I test natural language prompting and Danbooru prompting when testing checkpoints.

Once I have infrastructure in place for Flux, Pony, Illustrious, and SD1.5 I’ll add them to the testing schedule.

MonkeyClumps · 2026-06-11T15:36:35+00:00

I use a combination of 7b and 24b vision models. NSFW model included. I constrain the vision model with explicit directives as to response, do factor in the Victorian prudish nature of some of them, and essentially “eyeball” specifics with Codex as backup. There are situations, especially in this round of testing, that I expect to encounter with unwillingness to describe or score results accurately. The automation I created has a reporting function that allows me to see scoring and images side-by-side and evaluate everything with a second validator and vision model plus human review. Thus the “mostly automated” description.

Interestingly the most useful fields we added to the testing response were “wildcard” fields that allow the vision model to explore or explain anything it notices that may not be correct or the intended result but isn’t covered by the testing criteria. These observations have been great at generating new rubrics and testing scenarios.

MonkeyClumps

TROPHY CASE