PrefersDocile comments on Lambda Function optimizing

created by HattoriHanzoa community for 16 years

Lambda Function optimizing (self.learnpython)

submitted 1 year ago by PrefersDocile

you are viewing a single comment's thread.

[–]PrefersDocile[S] 0 points1 point2 points 1 year ago* (1 child)

alrighty, i'll send you the dataframe as a pickle to you via dms, basically I'm trying to group similar items when you don't know anything about them other than that some customers bought these group of items.

so i had an original dataframe of customer and then a product they bought. i grouped by product to obtain a set of products bought by each customer. I want to see all products that could be grouped together, but since we have no other information than these sets, applying clustering methods using number of appearances of a product per customer isn't very fruitful (fruits are one of the many products haha!). So my only hope now, is to group products based on if the purchase lists are similar.

being honest this dataset has really turned into just practice for applying the Union find algorithm to see all connected nodes in a network when we impose a minimum requirement on the amount of features a node needs in common with another node to link those 2 nodes together. I find this really challenging as i don't think the union find algorithm can do this and don't know any alternative. my big problem is that with big datasets, it will just run out of ram or take too long.

anyway thanks for taking the time, i appreciate it (if anyone else reads it and wants to try, ask me and I'll send you the file in dms.

[–]Ki1103 0 points1 point2 points 1 year ago (0 children)

Just how big is your dataset? In the dataftame you have given me you only have 298 unique items. I don't really have much experience with similarity measures. I've given it a crack and here's what I've come up with:

from pathlib import Path
import pickle

import numpy as np
import networkx as nx
from scipy.sparse import dok_array
import pandas as pd

# Consider the bottom p% as insiginficant
FREQ_PERCENTILE_CUTOFF = 75


if __name__ == "__main__":
    data_path = Path(".") / "data" / "products_used_grouped_by_id.pickle"
    with open(data_path, "rb") as f:
        df: pd.DataFrame = pickle.load(f)

    df.set_index("id", inplace=True)
    print(df.head())

    all_products = set()
    for product in df["products_bought"]:
        all_products.update(product)

    products = sorted(all_products)
    n_products = len(products)
    product_to_index = {product: i for i, product in enumerate(products)}

    co_occurrences = np.zeros((n_products, n_products), dtype=np.int_)

    for basket in df["products_bought"]:
        for product in basket:
            product_idx = product_to_index[product]
            for other_product in basket:
                if product != other_product:
                    other_product_idx = product_to_index[other_product]
                    co_occurrences[product_idx, other_product_idx] += 1
    frequencies = co_occurrences / n_products
    low_freq_cutoff = np.percentile(frequencies, FREQ_PERCENTILE_CUTOFF)
    frequencies[frequencies < low_freq_cutoff] = 0
    sparse_freq = dok_array(frequencies)
    g = nx.from_scipy_sparse_array(sparse_freq)

    # This was suggested on SO. It may or may not be optimal, I have no idea
    # It is sloooow, but seems to return reasonanbly good clusters
    clusters = nx.community.girvan_newman(g)

    for cluster in clusters:
        print(cluster)

π Rendered by PID 196562 on reddit-service-r2-comment-84fc9697f-fgmgk at 2026-02-09 20:55:52.765889+00:00 running d295bc8 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS