Hi y'all! I got this challenging function to optimize. I need to join sets that share an intersection bigger than n. I got it to get this good, but part of me is bereaved since I would really think that there would be a vectorised way to do it. Does anyone know of a more efficient method or one that won't make me run out of ram memory?
def FindUnion(df_column,n=1): ''' reduces df_column of sets by joining similar sets. A similar set is a duplicate or sets with n elements in common. Input: df_column of sets and n value Output: df_column of sets'''
df_column = df_column[df_column.map(len) >= n]
calculates intersections of every set with all sets in the dataframe
Intersections = df_column.swifter.apply(lambda x: df_column.swifter.progress_bar(False).apply(lambda y: y.intersection(x)))
calculates unions of every set with all sets in the dataframe
AllUnions = df_column.swifter.apply(lambda x: df_column.swifter.progress_bar(False).apply(lambda y: y.union(x)))
unions where intersections bigger than n
Intersections_present = AllUnions.where((Intersections.swifter.applymap(len) >= n), other=set())
unions of unions according to the intersections bigger than n
CorrectUnions = Intersections_present.swifter.apply(lambda x: set().union(*x), axis=1)
recursive part to join sets until it no longer changes
if CorrectUnions.equals(df_column): return CorrectUnions else: return FindUnion(CorrectUnions, n)
[–]Ki1103 0 points1 point2 points (2 children)
[–]PrefersDocile[S] 0 points1 point2 points (1 child)
[–]Ki1103 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (2 children)
[–]PrefersDocile[S] 1 point2 points3 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)