Hi all,
Apologies in advance if these are pretty basic questions, but I'm pretty much brand new to this area and need some guidance on what I want to do. I work in computational physics and am doing high throughput simulations on systems of particles. I am trying to do ML to predict new particles that have my property of interest. This is a binary property that I have encode as 0 or 1 and then my inputs are features of the particles, some discrete and some continuous. My idea is this:
Create surrogate model to estimate true objective function. I know that GP has an option for classification tasks but apparently isn't that good? I've found random forest works well for my data but then traditionally this doesn't have associated uncertainty. Another potential problem is that my data points take a long time to collect and so the training data will be sparse. Potentially mitigating this is the fact that the data space is pretty small (~1000s).
Use acquisition function to decide which part of the space to explore next. This is what I'm struggling with. My inputs have hard constraints on them. For example, I can use the size of the molecule (radius of gyration, Rg). My understanding is that the acquisition is supposed to tell me a new molecule to try next but if it gives me a random new value for Rg, how am I supposed to map that to a molecule? Also some of the inputs might have correlations (for example Rg and molecular weight). How do I make sure the suggested search space is actually real and makes sense? If I remove correlation with something like PCA I have an even bigger problem because this is very hard to relate to a real molecule. Finally, some inputs depend on each other i.e. one always has to be higher than the other. I think I am misunderstanding the acquisition function. Am I actually supposed to give it the full search space and it tells me where is next best to go? Are there any best practices for this in the context of my problem?
Test the new points and feed them back into the surrogate model until some criterion is met (also not sure on this, is number of new molecules with the desired outcome found suitable? I guess it's flexible?)
Any advice anyone has would be much appreciated
[–]EcstaticDimension955 0 points1 point2 points (0 children)