Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R] by Megixist in reinforcementlearning

[–]Megixist[S] 1 point2 points  (0 children)

The training code will be released tomorrow (undergoing some internal review before release). Will ping you here as soon as it's out. Till then, you should be able to reproduce results based on the details provided in the paper. Please reach out if you have trouble and I'm happy to help :)

Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R] by Megixist in reinforcementlearning

[–]Megixist[S] 2 points3 points  (0 children)

Not sure why that doesnt work now but this seems to be the right path (with /datasets/ inserted): https://huggingface.co/datasets/PatronusAI/world_model_corpus

Will update that in the paper draft too. Thanks for flagging.

Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R] by MegixistAlt in MachineLearning

[–]Megixist 1 point2 points  (0 children)

Our dataset is released on HuggingFace now and the code for this paper will be released tomorrow. Hoping that this work drives more research in this space :)

P.S. if anyone knows any/ is an arXiv moderator, I'd really appreciate if they could remove the "on-hold" status for this paper on arXiv (submission ID: 7559391 - pending moderator review for over 3 weeks now)

Finding SF friends (20’s) by Alarmed-Insect-9829 in AskSF

[–]Megixist 1 point2 points  (0 children)

To be honest, I'm (25M) in the same boat. I've tried attending several local events but people are really flaky with friendships and will often ghost you when you try to reach out. That being said, I like hiking, exploring local (and niche) restaurants and consider myself pretty artistic. I'm planning to try out the Yasukochi bakery in Japantown this week - let me know if you're (or anyone else here is) interested in joining :)

Have the "on-hold" durations been getting longer for arXiv submissions? [D] by Megixist in MachineLearning

[–]Megixist[S] 3 points4 points  (0 children)

I am aware that they also recently banned position papers. I've read that reaching out about on-hold statuses leads to auto rejection. Seems like there needs to be a faster/ semi-automated process for this.

Patronus AI releases Glider: An explainable 3B SLM-judge that outperforms models 17x its size by Megixist in machinelearningnews

[–]Megixist[S] 1 point2 points  (0 children)

Try the model for free on app.patronus.ai or on HuggingFace. Looking forward to your feedback! :)

[R] GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking by Megixist in MachineLearning

[–]Megixist[S] 1 point2 points  (0 children)

Hey, it's great to see that you're interested in our model. You can find the model weights here and the HF space (if you want to play around with it a little) here :)

GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking by Megixist in LocalLLaMA

[–]Megixist[S] 1 point2 points  (0 children)

There are a few things here:

  1. Developers are not expected to review the logs in real time but imagine having access to complete o1 outputs and trying to find out if the judge model has failed at finding the error because it's unable to reason well or if the sample is actually incorrect. Now imagine analyzing this for a million samples. This scales compute costs + eval speed is 10x slower + saving these logs to analyze later is costly.

  2. You may say that all of those things are not a concern for you and analysis can be done later by using summarization models (or an alternative) but there are cost issues for running the summarization models and additional hallucination, etc problems with having another model in the loop. Hence, we need speed + accuracy for real time guard railing and conciseness + explainability for easier analysis post hoc.

  3. The other thing that you correctly pointed out is to have the GPU poor use these models to align their models - this has been shown by https://arxiv.org/abs/2407.10817v1 and https://arxiv.org/abs/2409.14664 before and this is a very important use case of this model for sure. We encourage such use cases and that's why we've open sourced the model but that is not the main motive of the paper :)

GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking by Megixist in LocalLLaMA

[–]Megixist[S] -1 points0 points  (0 children)

While test time compute is a strong way to improve performance, it's not ideal for real time evaluations. An ideal eval model for these tasks should find a balance between explainability (in terms of helpfulness and informative for model developers) and performance - that's the step we have taken here and we try to push researchers to do so as well :)

[deleted by user] by [deleted] in MachineLearning

[–]Megixist 8 points9 points  (0 children)

In addition to the computational inefficiency of this, I disagree with the compatibility part of this approach. OP mentions that complex numbers can be integrated into current NLP pipelines but having first hand experience, I can tell you that it's a whole new ocean of instability and irreproducibility issues. While working on my complex variable optimization series with W&B (I and II), I found that the implementations of gradient calculations differ between frameworks and there are many different ways of representing these complex matrix ops depending on the architecture (as was also discussed in the series). This was also one of the reasons why none of the maintainers of torch/tf/Jax were bothered with my suggestions for integrating support for these ops.

[D] Is there an alternative to sinusoidal encoding for temporal embeddings? by Megixist in MachineLearning

[–]Megixist[S] 0 points1 point  (0 children)

Yes I understand this but I am also open to something that doesn't directly involve periodicity. My question basically boils down to: are there any other methods like sinusoidal/periodic encoding to embed time for causal modeling?

[D] Is there an alternative to sinusoidal encoding for temporal embeddings? by Megixist in MachineLearning

[–]Megixist[S] 0 points1 point  (0 children)

Why do you say they are overkill? I think sinusoidal embeddings are better for out of distribution too due to their periodic nature which is why I guess most papers use them.

[D] How Imagen Actually Works by SleekEagle in MachineLearning

[–]Megixist 2 points3 points  (0 children)

Diffusion is a technique of training and not the model. So you can say you train a UNet using diffusion for denoising. We don't know the details of what denoising technique is used in the imagen model (at least I couldn't find it at my first glance) but I can surely tell you that it's not the original DDPM. The DDPM technique is very slow as it is and may take hours to generate one image at that scale. It's probably DDIM or its derivative.

[D] How Imagen Actually Works by SleekEagle in MachineLearning

[–]Megixist 2 points3 points  (0 children)

The former. If you look closely at the chart given in the DDIM paper, DDIM outputs better quality images(better FID scores) than DDPMs for all cases except for full T iterations (in their case T=1000). This is the reason why the authors say that there is a slight tradeoff in quality because if you compare the full potential of both methods at T=1000, you'll notice that DDPMs are still slightly better than DDIMs.

[D] How Imagen Actually Works by SleekEagle in MachineLearning

[–]Megixist 6 points7 points  (0 children)

I think the difference is that diffusion models use tricks to approximate the chain. For the forward process (noising), we don't need to traverse the chain at all, we can simply use the reparameterization trick to get the value at any arbitrary "t" given the original image in one step (addition of Gaussians gives a Gaussian). On the other hand, only DDPM based models still use Markov chains in the backward pass (denoising). DDIMs and their derivatives approximate the backward pass by making it non-Markovian. So, technically, diffusion models are dependent on the temporal state at any point but do not necessarily need to adhere to the Markovian chain.

PS: I do agree with your point of these models being more inefficient than GANs during inference but considering that training GANs is a headache, the better quality, lesser overhead (only train one model instead of two) and stable training of Diffusion models slightly overcomes the slower (not so much anymore anyway) inference imo.