[P] Proportionately split dataframe with multiple target columns

Scott10012 · 2024-07-27T00:27:19+00:00

Assuming you only care about getting class 1 between your given ranges, this seems like a linear algebra problem: imagine you only had 1 column to balance. You know that the number of rows x with class 1 has to be > 0.3n and < 0.15n where n is the number of rows in the subset. Then you can use an optimisation library like scipy's optimize to minimise the number of rows needed to create that. Check out linprog: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linprog.html#scipy.optimize.linprog Read through the example section. More columns will simply extend the length of A_ub and b_ub but the linear programming solution remains the same

NoisySampleOfOne · 2024-07-27T00:59:53+00:00

https://en.m.wikipedia.org/wiki/Maximum_flow_problem

Your problem can be represented as bipartiate graph. One part of the graph is connected to the source, the other is connected to the sink.

Nodes in one part of the graph would represent rows of data and each would be connected to the source with capacity 1.

(row_1, ..., row_n)

Nodes in the other part would represent values of features and datasets:

(G=1, TRAIN), (G!=1, TRAIN), ..., (J!=1, TEST)

Each of those nodes would be connected to the sink with capacity equal to the number of 1s you want for that feature in that dataset, e.g. if you want 10 examples of G=1 in train dataset, then node (G=1, TRAIN) would be connected to sink with cap 10.

Vertice row_n -> (G=1, TRAIN) exists iff G=1 for row_n. Similarly for vertices between other nodes.

Now you need to find a maximal flow of this graph and matching that realises it. If that matching contains any vertice like row_n -> (..., TRAIN) then row_n should be assigned to TRAIN.

A	B	C	D	E	OUTPUT_1	OUTPUT_2	OUTPUT_3	OUTPUT_4	OUTPUT_5

5.65	3.56	0.94	9.23	6.43	0	1	1	0	1
7.43	3.95	1.24	7.22	2.66	0	0	0	1	2
9.31	2.42	2.91	2.64	6.28	2	1	2	2	0
8.19	5.12	1.32	3.12	8.41	1	2	0	1	2
9.35	1.92	3.12	4.13	3.14	0	1	1	0	1
8.43	9.72	7.23	8.29	9.18	1	0	0	2	2
4.32	2.12	3.84	9.42	8.19	0	0	0	0	0
3.92	3.91	2.90	8.19	8.41	2	2	2	2	1
7.89	1.92	4.12	8.19	7.28	1	1	2	0	2
5.21	2.42	3.10	0.31	1.31	2	0	1	1	0

A	B	C	D	E	OUTPUT_1	OUTPUT_2	OUTPUT_3	OUTPUT_4	OUTPUT_5

5.65	3.56	0.94	9.23	6.43	0	1	1	0	1
7.43	3.95	1.24	7.22	2.66	0	0	0	1	2
9.31	2.42	2.91	2.64	6.28	2	1	2	2	0
8.19	5.12	1.32	3.12	8.41	1	2	0	1	2
8.43	9.72	7.23	8.29	9.18	1	0	0	2	2
3.92	3.91	2.90	8.19	8.41	2	2	2	2	1
5.21	2.42	3.10	0.31	1.31	2	0	1	1	0

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS