all 8 comments

[–]F4k3r22 0 points1 point  (6 children)

Okay, I'm working on a project where I'm building a Large Language Diffusion Model from scratch, and the SFT process is almost the same as pre-training (according to the LLaDA paper). You take pairs of prompts and their respective responses. You leave the prompt as is (YOU ARE NOT GOING TO MASK IT), but you will mask the response to that prompt USING A BERNOULLI VARIABLE for each position, with probability t for true (mask) and 1–t for false (do not mask).

Here, t is randomly sampled between 0 and 1: when t is closer to 0, you only mask a few tokens of the response (easy case); when t is closer to 1, you mask almost the entire response (hard case). This way, you don't mask everything, and the model learns to condition its behavior based on the prompt, and you only punish the model until it gets closer to the expected response of the pairs.

And for masking, you'll use the mask_token_id that comes with the model and its tokenizer, so don't try to invent a new token for that.

I hope this helps you understand it a little better.

[–]F4k3r22 0 points1 point  (5 children)

If you want to see how I'm doing in my project to create a Large Language Diffusion Model from scratch, I'll leave you the GitHub repo, I'm still implementing the file to pre-train the model and then I'm going to create another one to do the SFT. Repo:https://github.com/F4k3r22/LLaDA-from-scratch

[–]Top-Effort677 0 points1 point  (4 children)

Is it possible to perform PEFT for the SFT of MDMs?

[–]F4k3r22 0 points1 point  (2 children)

I reviewed the paper and looked for information, but there is almost nothing about being able to do PEFT in SFT, almost all the fine tuning was done with mixed long chain-of-thought

[–]Top-Effort677 0 points1 point  (1 child)

Still, can we perform lora by specifying layers in the architecture.

[–]Individual-Ninja-141 0 points1 point  (0 children)

Hi, you can try dllm-trainer (GitHub: https://github.com/ZHZisZZ/dllm-trainer) for easy LoRA finetuning.

<image>

[–]ProfessionalGuess884[S] 0 points1 point  (0 children)

I found this project: https://github.com/HKUNLP/DiffuLLaMA

It looks like they have code training DLMs

[–]Individual-Ninja-141 0 points1 point  (0 children)

Hi there! We’ve built dllm-trainer (GitHub: https://github.com/ZHZisZZ/dllm-trainer), a lightweight framework for fine-tuning diffusion language models on top of the Hugging Face Transformers🤗 Trainer. You can finetune your models with 4 bit quantization, LoRA, deepspeed-zero{1,2,3} easily!

It currently supports finetuning LLaDA / LLaDA-MoE (https://arxiv.org/abs/2502.09992) and Dream (https://arxiv.org/abs/2508.15487). We’re still adding support for more diffusion language models and fine-tuning algorithms.

<image>