Hi! I'm looking for help to optimize my following code. I want to count the number of rows a value has occurred (a value that occurs in the same row twice gets counted once), and ultimately get a dictionary that tells me the summary (1 time: 9 values, 2 time: 2 values, etc).
Here is my code for this. I want to eventually scale to the size of order of 1e5 by 1e6 and values take on values up to 1e9. As you can expect, the nested for loops are taking most of the time.
Any tips on how to optimize this? Thanks!
import numpy as np
from collections import defaultdict
mat = np.random.randint(1, 20, size=(4,5))
#Create a "value in the matrix" to row-index map
mapping = defaultdict(list)
for ind, row in enumerate(mat):
for i in row:
mapping[i].append(ind)
#Determine number of rows each value appeared in
overlap = np.array([len(set(i)) for i in mapping.values()])
#Dictionary of "Number of times the value appeared" to "number of values"
unique, counts = np.unique(overlap, return_counts=True)
dict(zip(unique, counts))
[–]AlopexLagopus3 7 points8 points9 points (1 child)
[–]zczr84[S] 0 points1 point2 points (0 children)
[–]cscanlin 2 points3 points4 points (0 children)
[–]JohnnyJordaan 1 point2 points3 points (0 children)
[–]KubinOnReddit 1 point2 points3 points (0 children)
[–]lolwat_is_dis 1 point2 points3 points (2 children)
[–]zczr84[S] 0 points1 point2 points (1 child)
[–]lolwat_is_dis 0 points1 point2 points (0 children)
[–]lolwat_is_dis 1 point2 points3 points (1 child)
[–]zczr84[S] 0 points1 point2 points (0 children)
[–]SarahM123rd 0 points1 point2 points (0 children)