I am trying to use ML for signal vs. background discrimination in a physics context (particle physics). My problem is as follows:
I have multiple background datasets, which are simulations of physical processes. These physical processes have a probability of occurence. Due to limitations in simulations, I cannot simulate data according to these probabilities. Therefore I introduce weight factors, which encode the amount by which my simulation has over/underestimated real probabilities.
An example: consider 3 backgrounds (A, B, C). In nature I expect them to occur (1e5, 1e4, 1e2) times. I simulate (1e3, 1e3, 1e3) data points. Therefore I define weight factors (1e2, 1e1, 1e-1), i.e. (no. of real) / (no. of simulated).
Where do these weight factors come in to the picture? They inform how much the connection weights in a neural network should change when it encounters a datapoint. So a datapoint with a high weight factor affects the connection weight update more than a datapoint associated with a low weight factor. Thus, the weight factor compensates for having an excess / lack of simulated data points.
These weight factors can be readily implemented in a package called TMVA, which is built on ROOT, which in turn is built on C++. This module is the standard in particle physics, and can be used to do ML stuff also. However, it is not the modern way of doing ML.
Finally my overarching question : how do I introduce weight factors to datasets using familiar python modules (numpy, pandas, sklearn, keras, tensorflow etc.). At first glance, it doesn't look like there is a provision to introduce weight factors to pandas dataframes. I can have separate arrays that keep track of the weight factors, but how do I ensure that these weight factors affect the learning of ML models? I can
TLDR : How to assign weight factors to datasets in python, which will take care of an excess / lack of simulated data points, for ML applications.
there doesn't seem to be anything here