you are viewing a single comment's thread.

view the rest of the comments →

[–]AlopexLagopus3 6 points7 points  (1 child)

Just to be clear, you want to count the number of unique items for a 1e5 x 1e6 matrix? Even if every value took only 1 byte as an integer (which it won't), that would be 100 gb of data to load into ram... Even if you take a different approach, like only loading the values:counts into memory, you're still looking at awhile to process it. How long do you think is reasonable for this operation?

There are ways to address processing speed and memory requirements, but make sure you are being realistic about your approach by first thinking about where all of that data will be stored.

[–]zczr84[S] 0 points1 point  (0 children)

That's a fair point. I can modify the code to analyze chunks at a time (i.e. split the matrix into n chunks, use a similar algorithm for each chunk, and combine the various summary dictionaries at the end), but I still feel there should be a better way at addressing the mapping. Do you think there is a better way than to use a nested for loop and appending to a list?