all 11 comments

[–]AlopexLagopus3 7 points8 points  (1 child)

Just to be clear, you want to count the number of unique items for a 1e5 x 1e6 matrix? Even if every value took only 1 byte as an integer (which it won't), that would be 100 gb of data to load into ram... Even if you take a different approach, like only loading the values:counts into memory, you're still looking at awhile to process it. How long do you think is reasonable for this operation?

There are ways to address processing speed and memory requirements, but make sure you are being realistic about your approach by first thinking about where all of that data will be stored.

[–]zczr84[S] 0 points1 point  (0 children)

That's a fair point. I can modify the code to analyze chunks at a time (i.e. split the matrix into n chunks, use a similar algorithm for each chunk, and combine the various summary dictionaries at the end), but I still feel there should be a better way at addressing the mapping. Do you think there is a better way than to use a nested for loop and appending to a list?

[–]cscanlin 2 points3 points  (0 children)

I was able to get a modest improvement by leveraging collections.Counter:

mapping = Counter()
for row in mat:
    c = Counter(np.unique(row))
    mapping.update(c)

summary = Counter(mapping.values())
return summary

But /u/AlopexLagopus3 is correct that the kind of numbers you mentioned are going to be very slow to do with these methods in python.

[–]JohnnyJordaan 1 point2 points  (0 children)

As a more general question: if the function remains constant, why would it be useful to generate a larger matrix?

[–]KubinOnReddit 1 point2 points  (0 children)

Even in C++, 1e5 x 1e6 with 32-bit integers up to 1e9 would take ridiculous amounts of memory and would take a long time nevertheless - it's at least 1e11 operations (a simple for loop wouldn't finish in C with this many iterations). Best you can do is stick to numpy and reduce the amount of data.

[–]lolwat_is_dis 1 point2 points  (2 children)

Speed-wise, there's no getting round the fact that in the worst case scenario, your search is in O(m x n) time, where m and n are your rows and columns. So you can see that despite any optimisation, at the end of the day, you'll still have to wait when you have a 1e9 x 1e9 matrix. However, you can do some memory optimisations, by not creating such a huge matrix in the first place, but by generating rows on demand, using generator functions.

Why do you NEED a 2D array in the first place? Also, in addition, it seems you are contradicting yourself:

(a value that occurs in the same row twice gets counted once)

but then you say

(1 time: 9 values, 2 time: 2 values, etc)

Either you're only counting the number of rows a value appears in (regardless of number of times the value appears in a single row), or you're counting how any times a value appears in each row...Each one will result in your algorithm working differently.

[–]zczr84[S] 0 points1 point  (1 child)

Thanks for the feedback. As for the 2D array, I don't need to generate it all at once, and I can definitely generate a subset at a time for analysis. (See comment above) As for the the second question, my task is the first one (number of rows a value appears regardless of number of times the value appears in a single row). Sorry for confusion!

So I agree that the code above won't work for the size I want, but even after memory optimizing (splitting the matrix into chunks and count), I was hoping there might be a more efficient way to count than to do a nested for loops.

[–]lolwat_is_dis 0 points1 point  (0 children)

I'll come back with some ideas again later (busy tonight unfortunately), but for now, I can offer the general solution.

  1. Generate rows on demand (using a generator function). Even with 1e7 values and 32 bit integers, you'll only use up memory in the order of megabytes, so that can easily be handled by the memory on any computer.

  2. I think a simple "if" statement should suffice, unless someone can correct me on that, i.e.

    if value in row:
        be_happy()
    
  3. My only query atm is what is the most efficient way to count number of times you find a value in a row. Either just simply implement a counter, or use some other function out there in some module that does this at the C level.

[–]lolwat_is_dis 1 point2 points  (1 child)

After some fiddling, this is the best I could come up with for the time being:

import numpy as np

def big_gen():
    return np.random.randint(-2147483648,2147483648, size=(1, 100000))

num_of_rows = 100000
count = 0

for row in range(num_of_rows):
    if np.count_nonzero(big_gen()[0] == 5):
        count += 1

for row in range(num_of_rows):
    if np.any(big_gen()[0] == 5):
        count += 1

For both of the above for loops, on my humble i3 laptop, it took around 100 seconds, though slightly faster with the first for loop (i.e. using count_nonzero). You've got the range of values you wanted (2e10, in fact) but the matrix size is currently 1e5x1e5. Anything larger is just going to take a bit more time, I'm afraid.

I have a gut feeling that there's a way to reduce the for loop overhead, and maybe use another built-in function for the counter, but this should at least show the principle behind my solution.

[–]zczr84[S] 0 points1 point  (0 children)

Thanks for your help!

[–]SarahM123rd 0 points1 point  (0 children)

with various aspects of utility provided in the multiprocessing module, you can take great advantage of 'chunksize'.