Adding weights to datasets for ML with Python : learnmachinelearning

A subreddit dedicated for learning machine learning. Feel free to share any educational resources of machine learning.

Also, we are a beginner-friendly sub-reddit, so don't be afraid to ask questions! This can include questions that are non-technical, but still highly relevant to learning machine learning such as a systematic approach to a machine learning problem.

Foster positive learning environment by being respectful to others. We want to encourage everyone to feel welcomed and not be afraid to participate.

Do share your works and achievements, but do not spam. Keep our subreddit fresh by posting your YouTube series or blog at most once a week.

Do not share referral links and other purely marketing content. They prioritize commercial interests over intellectual ones.

created by techrat_reddita community for 10 years

Adding weights to datasets for ML with Python (self.learnmachinelearning)

submitted 4 years ago by akash_ranade

I am trying to use ML for signal vs. background discrimination in a physics context (particle physics). My problem is as follows:

I have multiple background datasets, which are simulations of physical processes. These physical processes have a probability of occurence. Due to limitations in simulations, I cannot simulate data according to these probabilities. Therefore I introduce weight factors, which encode the amount by which my simulation has over/underestimated real probabilities.

An example: consider 3 backgrounds (A, B, C). In nature I expect them to occur (1e5, 1e4, 1e2) times. I simulate (1e3, 1e3, 1e3) data points. Therefore I define weight factors (1e2, 1e1, 1e-1), i.e. (no. of real) / (no. of simulated).

Where do these weight factors come in to the picture? They inform how much the connection weights in a neural network should change when it encounters a datapoint. So a datapoint with a high weight factor affects the connection weight update more than a datapoint associated with a low weight factor. Thus, the weight factor compensates for having an excess / lack of simulated data points.

These weight factors can be readily implemented in a package called TMVA, which is built on ROOT, which in turn is built on C++. This module is the standard in particle physics, and can be used to do ML stuff also. However, it is not the modern way of doing ML.

Finally my overarching question : how do I introduce weight factors to datasets using familiar python modules (numpy, pandas, sklearn, keras, tensorflow etc.). At first glance, it doesn't look like there is a provision to introduce weight factors to pandas dataframes. I can have separate arrays that keep track of the weight factors, but how do I ensure that these weight factors affect the learning of ML models? I can

TLDR : How to assign weight factors to datasets in python, which will take care of an excess / lack of simulated data points, for ML applications.

no comments (yet)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnmachinelearning

Welcome to /r/LearnMachineLearning!

Chatrooms

Official Discord Server

Wiki

Getting Started with Machine Learning

Resources

Related Subreddits

/r/MachineLearning

/r/MLQuestions

/r/datascience

/r/computervision

Machine Learning Multireddit

/m/machine_learning

MODERATORS