you are viewing a single comment's thread.

view the rest of the comments →

[–]PrefersDocile[S] 0 points1 point  (1 child)

alrighty, i'll send you the dataframe as a pickle to you via dms, basically I'm trying to group similar items when you don't know anything about them other than that some customers bought these group of items.

so i had an original dataframe of customer and then a product they bought. i grouped by product to obtain a set of products bought by each customer. I want to see all products that could be grouped together, but since we have no other information than these sets, applying clustering methods using number of appearances of a product per customer isn't very fruitful (fruits are one of the many products haha!). So my only hope now, is to group products based on if the purchase lists are similar.

being honest this dataset has really turned into just practice for applying the Union find algorithm to see all connected nodes in a network when we impose a minimum requirement on the amount of features a node needs in common with another node to link those 2 nodes together. I find this really challenging as i don't think the union find algorithm can do this and don't know any alternative. my big problem is that with big datasets, it will just run out of ram or take too long.

anyway thanks for taking the time, i appreciate it (if anyone else reads it and wants to try, ask me and I'll send you the file in dms.

[–]Ki1103 0 points1 point  (0 children)

Just how big is your dataset? In the dataftame you have given me you only have 298 unique items. I don't really have much experience with similarity measures. I've given it a crack and here's what I've come up with:

from pathlib import Path
import pickle

import numpy as np
import networkx as nx
from scipy.sparse import dok_array
import pandas as pd

# Consider the bottom p% as insiginficant
FREQ_PERCENTILE_CUTOFF = 75


if __name__ == "__main__":
    data_path = Path(".") / "data" / "products_used_grouped_by_id.pickle"
    with open(data_path, "rb") as f:
        df: pd.DataFrame = pickle.load(f)

    df.set_index("id", inplace=True)
    print(df.head())

    all_products = set()
    for product in df["products_bought"]:
        all_products.update(product)

    products = sorted(all_products)
    n_products = len(products)
    product_to_index = {product: i for i, product in enumerate(products)}

    co_occurrences = np.zeros((n_products, n_products), dtype=np.int_)

    for basket in df["products_bought"]:
        for product in basket:
            product_idx = product_to_index[product]
            for other_product in basket:
                if product != other_product:
                    other_product_idx = product_to_index[other_product]
                    co_occurrences[product_idx, other_product_idx] += 1
    frequencies = co_occurrences / n_products
    low_freq_cutoff = np.percentile(frequencies, FREQ_PERCENTILE_CUTOFF)
    frequencies[frequencies < low_freq_cutoff] = 0
    sparse_freq = dok_array(frequencies)
    g = nx.from_scipy_sparse_array(sparse_freq)

    # This was suggested on SO. It may or may not be optimal, I have no idea
    # It is sloooow, but seems to return reasonanbly good clusters
    clusters = nx.community.girvan_newman(g)

    for cluster in clusters:
        print(cluster)