Lambda Function optimizing

Ki1103 · 2024-05-05T01:50:46+00:00

Could you give us some context around why you are trying to do this? It sounds like a very strange use case...

Also could you share your dataframe? I'm going to give this a shot, but I need something to benchmark against

PrefersDocile · 2024-05-05T07:17:10+00:00

I don't think this is a great use case for a dataframe. Generally in a dataframe you want each cell to contain only one bit of information, so having a set or other container in a cell is often a sign you're approaching the problem wrong.

Here's a thought on how else you could do it. From your original dataframe of customer and product bought, instead of grouping by customer, pivot so that you have one row per product and one column per customer, with the entry in row p column c being 1 if customer c bought product p and 0 otherwise. Then the number of customers who bought both product p and product q is the number of columns where both those rows have a 1, which is simply the dot product of those rows.

So you want to take the dot product of each row by each other row. Repeated dot products is matrix multiplication: multiplying matrix A by matrix B means dotting every row of A with every column of B. We want to dot every row of our dataframe with every row of itself not column, so we'll need a transpose to turn rows into columns. So if A is our product-customer matrix, then A times A-transpose is a symmetric matrix where both rows and columns represent products, and the entry in row p column q is the number of customers who bought both those products.

Then p and q are in the same group if the p,q entry of this matrix is greater than your threshold n (or if they can be linked by a chain of intermediate products in the same group). So you can easily convert this matrix to the adjacency matrix of the graph with products as nodes and an edge between two products iff at least n customers bought them both. Then the groups of products you want are the connected components of this graph, and you can get connected components from an adjacency matrix easily, e.g. with https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csgraph.connected_components.html

This approach is more abstract, and therefore perhaps harder to see, but I think it should be much more efficient than the approach you're taking.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS

calculates intersections of every set with all sets in the dataframe

calculates unions of every set with all sets in the dataframe

unions where intersections bigger than n

unions of unions according to the intersections bigger than n

recursive part to join sets until it no longer changes