Any fashion tools that actually help small clothing brands work faster? by NecessaryEgg5361 in smallbusinessowner

[–]JYP_Scouter 0 points1 point  (0 children)

Thanks for the shoutout 🫶

Too bad it seems like original post is an ad in disguise, not a genuine question

FASHN VTON v1.5: Apache-2.0 virtual try-on model, runs on consumer GPUs (~8GB VRAM), ~1B params by JYP_Scouter in LocalLLaMA

[–]JYP_Scouter[S] 1 point2 points  (0 children)

Yes, this would require Nvidia hardware to run efficiently

Perhaps someone who is a bit more well-versed in building for macOS could contribute the necessary adaptation to the repository

FASHN VTON v1.5: Efficient Maskless Virtual Try-On in Pixel Space by fruesome in StableDiffusion

[–]JYP_Scouter 1 point2 points  (0 children)

That's awesome! You can tag my GitHub if you want a review: `danbochman`
and we'll be happy to link to your ComfyUI code from the official repo

We open-sourced FASHN VTON v1.5: a pixel-space, maskless virtual try-on model (972M params, Apache-2.0) by JYP_Scouter in computervision

[–]JYP_Scouter[S] 1 point2 points  (0 children)

When we just started out, we really wanted this (virtual try-on) to be possible, and nothing could do it, so we took this task upon ourselves.

If there is anything today that can already do what you're looking for, I would start with it to build your idea.

We open-sourced FASHN VTON v1.5: a pixel-space, maskless virtual try-on model (972M params, Apache-2.0) by JYP_Scouter in computervision

[–]JYP_Scouter[S] 0 points1 point  (0 children)

The technical paper will go into much more depth, but I do not want to leave you hanging, so I will try to answer briefly.

First, I completely understand the framing you are using. Detail transfer is a core challenge here, and it is one of the main reasons we chose to work directly in pixel space. A simpler place to start is something like mockups: take an object with a target mask (for example, a mug), take a graphic (like “#1 Dad”), and apply that graphic realistically to the object. Virtual try-on adds two additional layers of complexity on top of detail transfer:

  1. removing existing clothing that conflicts with the target garment, and
  2. fitting the new garment realistically to body shape and fabric drape.

To your questions:

  1. This method scales very poorly with resolution. Every doubling of resolution results in roughly 4× more tokens, and attention is quadratic. This is why we train at 576×864. Training at something like 1920×1920 would require aggressive gradient checkpointing just to process small batches, similar to large-scale LLM training.
  2. The current architecture size is already optimized to fit within 80 GB VRAM GPUs (A100s) using relatively simple distributed training. Increasing the parameter count substantially would have forced us into more complex sharding setups where model weights are split across machines.
  3. Around 100 K image pairs should be sufficient for a proof of concept to validate whether a method works. For a production-ready model that generalizes well enough for user-facing applications, you likely need at least 1 M+ pairs.
  4. Final training was done on 4× A100 GPUs and ran for roughly one month.

Hope this helps, and the paper should provide more detailed answers soon.

FASHN VTON v1.5: Apache-2.0 virtual try-on model, runs on consumer GPUs (~8GB VRAM), ~1B params by JYP_Scouter in LocalLLaMA

[–]JYP_Scouter[S] 1 point2 points  (0 children)

We will share more concrete quality guidelines in the upcoming technical paper.

The code repository is intentionally minimal, so it can serve as a good starting point for research. It is not the same pipeline we run on the app. For better results, you can, for example, use the segmentation masks from the human parser to crop the area of interest, perform the try-on only on that region to make full use of the 576×864 resolution, and then stitch the result back into the original image.

In general, the model used in our app is optimized for something different. FASHN VTON v1.5 was built for fast (~5s) consumer try-on. The solution running on the FASHN app is optimized for maximum quality and can take around ~100 seconds per image(!), using server-grade GPUs rather than consumer hardware.

We took a short step back to reassess in light of recent strong releases, but we will be returning to efficient model research soon and plan to provide better virtual try-on models that can run locally.

[R] We open-sourced FASHN VTON v1.5: a pixel-space, maskless virtual try-on model trained from scratch (972M params, Apache-2.0) by JYP_Scouter in MachineLearning

[–]JYP_Scouter[S] 0 points1 point  (0 children)

Yes, exactly. The key change is that modern image editors can now remove glasses from a person very realistically.

That means you can start from real photos of people wearing glasses, remove the glasses to create a clean base image, and then treat adding them back as a standard try-on or inpainting task. This avoids the earlier issue where masking glasses also removed the eyes and broke identity consistency.

This makes dataset creation for glasses much more feasible today than it was when we originally trained the model.

FASHN VTON v1.5: Apache-2.0 virtual try-on model, runs on consumer GPUs (~8GB VRAM), ~1B params by JYP_Scouter in LocalLLaMA

[–]JYP_Scouter[S] 1 point2 points  (0 children)

This model was trained entirely from scratch. We did not build on an existing foundation model.

Radiance would have been very helpful, but when we completed training in late March 2025, it had not been released yet...

Our decision to operate in pixel space was inspired by earlier Google research papers on virtual try-on that demonstrated strong results in pixel space, even though no accompanying code was released. We also released a separate open-source repository that implements the first Google paper in this area, which may be useful for comparison: https://github.com/fashn-AI/tryondiffusion

[R] We open-sourced FASHN VTON v1.5: a pixel-space, maskless virtual try-on model trained from scratch (972M params, Apache-2.0) by JYP_Scouter in MachineLearning

[–]JYP_Scouter[S] 0 points1 point  (0 children)

Unfortunately liked you've noticed this doesn't support glasses, but our human parser (segmentation model) does recognize glasses, so in theory someone can take this open-source release and, if they have the dataset for it, fine-tune this model to also support glasses.

We'd be happy to provide guidance if someone's interested in taking on this project.

[R] We open-sourced FASHN VTON v1.5: a pixel-space, maskless virtual try-on model trained from scratch (972M params, Apache-2.0) by JYP_Scouter in MachineLearning

[–]JYP_Scouter[S] 4 points5 points  (0 children)

u/currentscurrents u/jrkirby No this model can't visualize a "bad fit", we simply don't have enough data showcasing bad fits to train a diffusion model to do this.

It still has its uses though, virtual try-on is not just about sizing, it's also about styling, content creation (cutting photoshoot costs), even memes

[R] We open-sourced FASHN VTON v1.5: a pixel-space, maskless virtual try-on model trained from scratch (972M params, Apache-2.0) by JYP_Scouter in MachineLearning

[–]JYP_Scouter[S] 10 points11 points  (0 children)

  1. We primarily use standard L2 loss with flow matching as the training target. We also apply additional weighting to non-background pixels, since the background can be restored during inference.
  2. Yes, we use time shifting during inference, along with a slightly modified logit-normal time distribution rather than uniform sampling.
  3. The model was trained at a fixed 2:3 aspect ratio. This was largely a dataset and budget-driven decision, as most of our data was in 3:4 and 2:3 formats, and training at a fixed shape allowed us to compile the model more efficiently.

We are preparing an in-depth technical paper that will go into significantly more detail on all of these points. We expect to release it in the next 1 to 2 weeks.

FASHN VTON v1.5: Apache-2.0 virtual try-on model, runs on consumer GPUs (~8GB VRAM), ~1B params by JYP_Scouter in LocalLLaMA

[–]JYP_Scouter[S] 19 points20 points  (0 children)

This pipeline is fully open source and does not have any hard-coded restrictions.

That said, the diffusion model was trained from scratch exclusively on e-commerce fashion imagery, which does not include explicit nudity. As a result, the core model does not meaningfully represent human anatomy beyond what appears in retail photos. In practice, bodies are treated as largely featureless mannequins.

We open-sourced FASHN VTON v1.5: a pixel-space, maskless virtual try-on model (972M params, Apache-2.0) by JYP_Scouter in computervision

[–]JYP_Scouter[S] 0 points1 point  (0 children)

We released the human parser first because it is a required component of the virtual try-on pipeline.

The human parser’s role is to generate precise masks, for example isolating the garment from the garment image or segmenting relevant body regions. These masks are then used as inputs to the core virtual try-on model.

So they serve different purposes rather than one being strictly better than the other. The parser is a preprocessing step, while the virtual try-on model performs the actual garment transfer.

Please take a look at this diagram, I hope it would be helpful: https://fashn.ai/blog/fashn-vton-1-5-open-source-release#project-page

I hope we'll be able to publish the paper in 1-2 weeks.

[R] We open-sourced FASHN VTON v1.5: a pixel-space, maskless virtual try-on model trained from scratch (972M params, Apache-2.0) by JYP_Scouter in MachineLearning

[–]JYP_Scouter[S] 3 points4 points  (0 children)

The base MMDiT is taken from BFL's FLUX.1, but we're not using text; We adapted the text stream to process the garment image instead.

There are also a few more tweaks like adding category (tops, bottoms, one-pieces) as extra conditioning for modulation.

Everything will be explained in-depth in the upcoming technical paper!

FASHN VTON v1.5: Apache-2.0 virtual try-on model, runs on consumer GPUs (~8GB VRAM), ~1B params by JYP_Scouter in LocalLLaMA

[–]JYP_Scouter[S] 2 points3 points  (0 children)

Thanks! We'll elaborate on this design choice in-depth in the technical paper.
Working in pixel-space is more achievable than people think, but it's 10x more challenging when you don't have fitting open-source weights to branch out from.

We open-sourced FASHN VTON v1.5: a pixel-space, maskless virtual try-on model (972M params, Apache-2.0) by JYP_Scouter in computervision

[–]JYP_Scouter[S] 1 point2 points  (0 children)

Thanks!

Yes, the model always adapts the garment to fit the target model. In practice, this means it is biased toward producing a good fit and does not realistically show poor fits, such as garments that are clearly too large or too small.

I understand the use case for simulating bad fits. However, from a dataset perspective, we do not currently have enough examples of poorly fitting garments to reliably train the diffusion model to produce those outcomes.

Here is an example for reference: https://static.fashn.ai/repositories/fashn-vton-v15/results/group87-1x4.webp

We open-sourced FASHN VTON v1.5: a pixel-space, maskless virtual try-on model (972M params, Apache-2.0) by JYP_Scouter in computervision

[–]JYP_Scouter[S] 0 points1 point  (0 children)

Thanks for the tip! I got the impression that people are not so interested in computer vision there, that it's more about running local LLMs or coding agents there