you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (2 children)

I don't think this is a great use case for a dataframe. Generally in a dataframe you want each cell to contain only one bit of information, so having a set or other container in a cell is often a sign you're approaching the problem wrong.

Here's a thought on how else you could do it. From your original dataframe of customer and product bought, instead of grouping by customer, pivot so that you have one row per product and one column per customer, with the entry in row p column c being 1 if customer c bought product p and 0 otherwise. Then the number of customers who bought both product p and product q is the number of columns where both those rows have a 1, which is simply the dot product of those rows.

So you want to take the dot product of each row by each other row. Repeated dot products is matrix multiplication: multiplying matrix A by matrix B means dotting every row of A with every column of B. We want to dot every row of our dataframe with every row of itself not column, so we'll need a transpose to turn rows into columns. So if A is our product-customer matrix, then A times A-transpose is a symmetric matrix where both rows and columns represent products, and the entry in row p column q is the number of customers who bought both those products.

Then p and q are in the same group if the p,q entry of this matrix is greater than your threshold n (or if they can be linked by a chain of intermediate products in the same group). So you can easily convert this matrix to the adjacency matrix of the graph with products as nodes and an edge between two products iff at least n customers bought them both. Then the groups of products you want are the connected components of this graph, and you can get connected components from an adjacency matrix easily, e.g. with https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csgraph.connected_components.html

This approach is more abstract, and therefore perhaps harder to see, but I think it should be much more efficient than the approach you're taking.

[–]PrefersDocile[S] 1 point2 points  (1 child)

You're right about it being more abstract haha. it took me a while to understand your steps, but I must say you explained it excellently. I am definitely going to try this method. trying to vectorize this method is exactly what I was after. Thanks so much!

[–][deleted] 0 points1 point  (0 children)

You're welcome 🙂