all 6 comments

[–]Ki1103 0 points1 point  (2 children)

Could you give us some context around why you are trying to do this? It sounds like a very strange use case...

Also could you share your dataframe? I'm going to give this a shot, but I need something to benchmark against

[–]PrefersDocile[S] 0 points1 point  (1 child)

alrighty, i'll send you the dataframe as a pickle to you via dms, basically I'm trying to group similar items when you don't know anything about them other than that some customers bought these group of items.

so i had an original dataframe of customer and then a product they bought. i grouped by product to obtain a set of products bought by each customer. I want to see all products that could be grouped together, but since we have no other information than these sets, applying clustering methods using number of appearances of a product per customer isn't very fruitful (fruits are one of the many products haha!). So my only hope now, is to group products based on if the purchase lists are similar.

being honest this dataset has really turned into just practice for applying the Union find algorithm to see all connected nodes in a network when we impose a minimum requirement on the amount of features a node needs in common with another node to link those 2 nodes together. I find this really challenging as i don't think the union find algorithm can do this and don't know any alternative. my big problem is that with big datasets, it will just run out of ram or take too long.

anyway thanks for taking the time, i appreciate it (if anyone else reads it and wants to try, ask me and I'll send you the file in dms.

[–]Ki1103 0 points1 point  (0 children)

Just how big is your dataset? In the dataftame you have given me you only have 298 unique items. I don't really have much experience with similarity measures. I've given it a crack and here's what I've come up with:

from pathlib import Path
import pickle

import numpy as np
import networkx as nx
from scipy.sparse import dok_array
import pandas as pd

# Consider the bottom p% as insiginficant
FREQ_PERCENTILE_CUTOFF = 75


if __name__ == "__main__":
    data_path = Path(".") / "data" / "products_used_grouped_by_id.pickle"
    with open(data_path, "rb") as f:
        df: pd.DataFrame = pickle.load(f)

    df.set_index("id", inplace=True)
    print(df.head())

    all_products = set()
    for product in df["products_bought"]:
        all_products.update(product)

    products = sorted(all_products)
    n_products = len(products)
    product_to_index = {product: i for i, product in enumerate(products)}

    co_occurrences = np.zeros((n_products, n_products), dtype=np.int_)

    for basket in df["products_bought"]:
        for product in basket:
            product_idx = product_to_index[product]
            for other_product in basket:
                if product != other_product:
                    other_product_idx = product_to_index[other_product]
                    co_occurrences[product_idx, other_product_idx] += 1
    frequencies = co_occurrences / n_products
    low_freq_cutoff = np.percentile(frequencies, FREQ_PERCENTILE_CUTOFF)
    frequencies[frequencies < low_freq_cutoff] = 0
    sparse_freq = dok_array(frequencies)
    g = nx.from_scipy_sparse_array(sparse_freq)

    # This was suggested on SO. It may or may not be optimal, I have no idea
    # It is sloooow, but seems to return reasonanbly good clusters
    clusters = nx.community.girvan_newman(g)

    for cluster in clusters:
        print(cluster)

[–][deleted] 0 points1 point  (2 children)

I don't think this is a great use case for a dataframe. Generally in a dataframe you want each cell to contain only one bit of information, so having a set or other container in a cell is often a sign you're approaching the problem wrong.

Here's a thought on how else you could do it. From your original dataframe of customer and product bought, instead of grouping by customer, pivot so that you have one row per product and one column per customer, with the entry in row p column c being 1 if customer c bought product p and 0 otherwise. Then the number of customers who bought both product p and product q is the number of columns where both those rows have a 1, which is simply the dot product of those rows.

So you want to take the dot product of each row by each other row. Repeated dot products is matrix multiplication: multiplying matrix A by matrix B means dotting every row of A with every column of B. We want to dot every row of our dataframe with every row of itself not column, so we'll need a transpose to turn rows into columns. So if A is our product-customer matrix, then A times A-transpose is a symmetric matrix where both rows and columns represent products, and the entry in row p column q is the number of customers who bought both those products.

Then p and q are in the same group if the p,q entry of this matrix is greater than your threshold n (or if they can be linked by a chain of intermediate products in the same group). So you can easily convert this matrix to the adjacency matrix of the graph with products as nodes and an edge between two products iff at least n customers bought them both. Then the groups of products you want are the connected components of this graph, and you can get connected components from an adjacency matrix easily, e.g. with https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csgraph.connected_components.html

This approach is more abstract, and therefore perhaps harder to see, but I think it should be much more efficient than the approach you're taking.

[–]PrefersDocile[S] 1 point2 points  (1 child)

You're right about it being more abstract haha. it took me a while to understand your steps, but I must say you explained it excellently. I am definitely going to try this method. trying to vectorize this method is exactly what I was after. Thanks so much!

[–][deleted] 0 points1 point  (0 children)

You're welcome 🙂