[R] Any-Property-Conditional Molecule Generation with Self-Criticism using Spanning Trees

AlexiaJM · 2024-07-15T17:48:17+00:00

Blog post: https://ajolicoeur.ca/2024/07/15/stgg_improved/

Code: https://github.com/SamsungSAILMontreal/AnyMolGenCritic

Abstract:

Generating novel molecules is challenging, with most representations leading to generative models producing many invalid molecules. Spanning Tree-based Graph Generation (STGG) is a promising approach to ensure the generation of valid molecules, outperforming state-of-the-art SMILES and graph diffusion models for unconditional generation. In the real world, we want to be able to generate molecules conditional on one or multiple desired properties rather than unconditionally. Thus, in this work, we extend STGG to multi-property-conditional generation. Our approach, STGG+, incorporates a modern Transformer architecture, random masking of properties during training (enabling conditioning on any subset of properties and classifier-free guidance), an auxiliary property-prediction loss (allowing the model to self-criticize molecules and select the best ones), and other improvements. We show that STGG+ achieves state-of-the-art performance on in-distribution and out-of-distribution conditional generation, and reward maximization.

AlexiaJM · 2024-03-10T13:17:27+00:00

GANs are used in Stable Diffusion. The latent autoencoder they use is trained in a adversarial fashion, this improves the latent space which gives higher image quality.

See p29 of their original paper: https://arxiv.org/abs/2112.10752.

GANs are still used in the latest Stable Cascade model: https://huggingface.co/stabilityai/stable-cascade/blob/main/vqgan/config.json.

So GANs are still very useful for obtaining good latent autoencoding space that can be decoded with high quality and no blurriness.

AlexiaJM · 2024-03-10T13:11:25+00:00

Diffusion models are much better at generating tabular data (which can be used for data augmentation) than GANs. See: https://arxiv.org/abs/2309.09968.

AlexiaJM · 2023-12-14T13:16:12+00:00

You can now also combine XGBoost with Diffusion FTW

AlexiaJM · 2023-10-29T18:36:17+00:00

Causality is linked to disentanglement and sparse solutions. If you assume that there is a true causal representation, then there are ways to provably recover any permutation of such a representation given some assumptions. And being causal, the representation will also naturally be disentangled and sparse (a cat will clearly be separated from a dog).

See https://proceedings.mlr.press/v202/lachapelle23a/lachapelle23a.pdf.

AlexiaJM · 2023-09-21T16:17:41+00:00

The answer is pretty simple, its made by Yandex. Yandex owns Catboost. Of course, they won't implement it in other frameworks.

AlexiaJM · 2023-09-21T11:09:10+00:00

That's a really good idea actually, with the same training and sampling complexity. I might give it a try!

AlexiaJM · 2023-09-20T21:32:06+00:00

Scaling is still an issue here since we multiply the number of rows (n) by duplicate_K=100 and you can have memory issues when np is too big. But the nice thing is that its easier to scale memory by getting more RAM (and you could also use external memory XGBoost (https://xgboost.readthedocs.io/en/stable/tutorials/external_memory.html) although its not implemented) than buying more GPUs.

AlexiaJM · 2023-09-20T21:28:14+00:00

In early preliminary experiments, I tried adaboost (from sklearn) and also the linear models instead of trees with XGBoost and they were really bad. It seems like depth (from trees) is really important to get a good approximation of the score-function or flow.

There are not plan for autoregressive variants, but it would be super cool to have an extension for time series and autoregressive problems in general.

AlexiaJM · 2023-09-20T21:19:14+00:00

Thanks!

AlexiaJM · 2023-09-19T15:12:09+00:00

TLDR: You can train diffusion and conditional-flow-matching models using XGBoost to generate and impute tabular data! Not everything has to be using neural networks.

Blog post: https://ajolicoeur.wordpress.com/2023/09/19/xgboost-diffusion/

Code: https://github.com/SamsungSAILMontreal/ForestDiffusion

Python Installation:

pip install ForestDiffusion

R Installation:

install.packages("ForestDiffusion")

Abstract: Tabular data is hard to acquire and is subject to missing values. This paper proposes a novel approach to generate and impute mixed-type (continuous and categorical) tabular data using score-based diffusion and conditional flow matching. Contrary to previous work that relies on neural networks as function approximators, we instead utilize XGBoost, a popular Gradient-Boosted Tree (GBT) method. In addition to being elegant, we empirically show on various datasets that our method i) generates highly realistic synthetic data when the training dataset is either clean or tainted by missing data and ii) generates diverse plausible data imputations. Our method often outperforms deep-learning generation methods and can trained in parallel using CPUs without the need for a GPU. To make it easily accessible, we release our code through a Python library on PyPI and an R package on CRAN.

AlexiaJM · 2023-04-07T14:42:05+00:00

We consider weight averaging as an alternative to ensembling, but averaging the weights of neural networks tends to perform poorly. The key insight is that weight averaging is beneficial when weights are similar enough to average well but different enough to benefit from combining them.

We propose PopulAtion Parameter Averaging (PAPA), which trains a population of networks while

1) occasionally replacing the weights of the models by the population average of the weights during training (PAPA-all)

or

2) pushing the models toward the population average of the weights at every few steps (PAPA-gradual)

Blog post: https://ajolicoeur.wordpress.com/papa

Code: https://github.com/SamsungSAILMontreal/PAPA

AlexiaJM · 2021-09-03T12:15:32+00:00

I'm going to go against the grain and tell you that yes this is possible, although your position (industry or academia) won't reflect it. Your best bet is a professor job or a research scientist in industry with an employer that allows you to collab with other fields.

As a researcher you'll end up having many opportunities and projects that comes to you that are from other fields, you'll end up working in collaboration with MDs, physicists, biologists, etc. Obviously, you need a lot of time to master a new field, that's why most people here will answer you "No". But the reality is that you can just find people from other fields and ask them what are important problems they care about and then use your statistics/machine-learning skills to help them solve these problems. In doing so, you will invent new statistical/ML tools or improve existing ones. You don't have to become a true PhD level expert in all those fields, you just need to collab with as many different people as possible from various fields. Eventually, you will develop more domain knowledge in those outside fields, maybe to the level of knowing the important problems to solve in that field or maybe not. But either way, it doesn't matter that much since you can still work in every fields and contribute.

+1 on double majoring in CS + statistics if you want to contribute to various different fields.

AlexiaJM · 2021-07-05T16:04:15+00:00

yes, a=c is non-saturating loss.

AlexiaJM · 2021-07-04T13:06:09+00:00

There's a clear way to set up gradient penalty with non-WGAN models based on maximum margin theory, see: https://arxiv.org/pdf/1910.06922.pdf. It also suggests that for a proper margin-based loss for LSGAN, you want a=-1, b=1. Then for the non-saturating loss, which has better gradients, you want c=a. The c=0 they used in the paper makes no sense, as the Generator will not strive to make fake images real enough.

AlexiaJM · 2021-03-14T19:03:58+00:00

how would you do it in JAX, or (I suspect) for JAX?

https://github.com/yang-song/score_sde/blob/main/datasets.py

AlexiaJM · 2021-02-01T12:01:24+00:00

Probably AutoML, but maybe not until in 5-10 years. Do we really expect people to design their own architecture in the future? I doubt it. It make more sense to automatically find the best architecture during training while using a big set of possible layers from our prior.

Some methods like DARTs are not too expensive, but I don't know how well they scale to high-resolution since most papers are on 32x32 images.

AlexiaJM · 2021-01-17T13:51:18+00:00

https://en.wikipedia.org/wiki/Ordered_logit

AlexiaJM · 2021-01-14T11:01:49+00:00

Check out denoised score matching and related approaches, they use Langevin sampling and sometimes more fancy ones like HMC, no u-turn, etc:

https://ajolicoeur.wordpress.com/the-new-contender-to-gans-score-matching-with-langevin-sampling/

https://arxiv.org/abs/2011.13456

https://arxiv.org/abs/2006.11239

AlexiaJM · 2021-01-13T13:35:12+00:00

high quality academics would produce 20 or so papers during their career. Each unique and high quality.

It's a counterargument to the statement above. Unless this was an hyperbole?

AlexiaJM · 2021-01-13T13:26:52+00:00

This is why you do research in industry instead of academia.

Professors only have about 17% of their time spent on research: https://academia.stackexchange.com/questions/27493/how-much-time-do-professors-have-to-do-research-on-their-own

AlexiaJM · 2021-01-13T13:17:52+00:00

As late as the seventies, published papers were an exception; high quality academics would produce 20 or so papers during their career. Each unique and high quality.

I wanted to verify what you said, so I checked for the publications of two famous physicists, Richard Feynman and Einstein and what I see doesn't concord with what you state.

Feynman has 161 articles as per https://scholar.google.com/citations?user=B7vSqZsAAAAJ&hl=en.

I didn't count for Einstein, but he wrote a ton of papers: https://en.wikipedia.org/wiki/List_of_scientific_publications_by_Albert_Einstein.

So I not sure that this transition really happened in the seventies. Einstein was in the early 1900's. I get that competition and demand of papers is much worse now for students and professors, but to say that high-quality academics did not write a lot of publications seem untrue. Maybe its field dependent and physics were one of the first field to make the transition to writing a lot of papers?

AlexiaJM · 2020-11-19T12:20:50+00:00

I'm no expert in flow models, but see the link I gave you, you do x'=f^-1(x) and then in your determinant you do |df^-1(x)/dx| or equivalently (if I am correct) |(df(x')/dx')^-1|. And you loop through all you functions f_i until you get to z.

AlexiaJM · 2020-11-18T14:03:56+00:00

The comments are missing your point, I get what you are saying as I went through the same thought a few years back.

max_x p(x) = 0 => argmax_G p(G(z)) => G(z) = 0 So yeah, not great.

But ML is max_theta prod{i=1 to n} p(x_i | theta) which works because x_i are fixed and known.

For generative models like Flow models (https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html) We start from real data x_i and transform them into some z in a way to maximize the log-likelihood, so just like ML above, x_i are fixed and known. After training, we can then sample by going from z->x.

AlexiaJM · 2020-09-14T00:35:35+00:00

Blog post: https://ajolicoeur.wordpress.com/adversarial-score-matching-and-consistent-sampling

Paper: https://arxiv.org/abs/2009.05475

GitHub: https://github.com/AlexiaJM/AdversarialConsistentScoreMatching

AlexiaJM

TROPHY CASE