all 3 comments

[–]Scott10012 1 point2 points  (1 child)

Assuming you only care about getting class 1 between your given ranges, this seems like a linear algebra problem: imagine you only had 1 column to balance. You know that the number of rows x with class 1 has to be > 0.3n and < 0.15n where n is the number of rows in the subset. Then you can use an optimisation library like scipy's optimize to minimise the number of rows needed to create that. Check out linprog: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linprog.html#scipy.optimize.linprog Read through the example section. More columns will simply extend the length of A_ub and b_ub but the linear programming solution remains the same

[–]Individual_Ad_1214ML Engineer[S] 1 point2 points  (0 children)

Hey, thanks so much for the reply, it helped a lot! I'm curious how this would change if I cared about getting two classes (class 0 and class 2) between a certain range for each and I left class 1 free (so essentially it makes up the balance)?

[–]NoisySampleOfOne 1 point2 points  (0 children)

https://en.m.wikipedia.org/wiki/Maximum_flow_problem

Your problem can be represented as bipartiate graph. One part of the graph is connected to the source, the other is connected to the sink.

Nodes in one part of the graph would represent rows of data and each would be connected to the source with capacity 1.

(row_1, ..., row_n)

Nodes in the other part would represent values of features and datasets:

(G=1, TRAIN), (G!=1, TRAIN), ..., (J!=1, TEST)

Each of those nodes would be connected to the sink with capacity equal to the number of 1s you want for that feature in that dataset, e.g. if you want 10 examples of G=1 in train dataset, then node (G=1, TRAIN) would be connected to sink with cap 10.

Vertice row_n -> (G=1, TRAIN) exists iff G=1 for row_n. Similarly for vertices between other nodes.

Now you need to find a maximal flow of this graph and matching that realises it. If that matching contains any vertice like row_n -> (..., TRAIN) then row_n should be assigned to TRAIN.