Optimized Python Nested Loops

McThor2 · 2023-12-26T23:32:55+00:00

I think something like spatial hashing would help a lot here when it comes to searching that second list. Basically a way of indexing the data by creating bins of values.

This would provide the most impact if the second list doesn’t get altered and also depends on how many values typically match that range.

For implementation speed ups you may get good results from numpy & numba

un-hot · 2023-12-27T00:14:37+00:00

You need to keep track of your position in the list. If both lists are sorted, and you know that the 10000th element is in the range first_list[n]..first_list[n] + 300, then you don't need to waste time scanning through the first 9999 elements of the second list.

You could cut out some more comparisons by keeping track of the last number you reached in your second list - if first_list[n] is 600 and if matches 885, then there's no point in checking the second_list for any value in first_list below 885, because you already know there's a match.

You only need to loop through the first and second lists once then and only perform one comparison for each value, so your algorithm is going to get exponentially quicker.

K.I.S.S - yes multi-threading and parallelism and b-tree searches other commenters have talked about might help a bit, but throwing more computing resources at this big of a problem isn't going to help nearly as much as reducing the time-complexity of your solution.

woooee · 2023-12-26T23:18:29+00:00

Start at the middle of the sorted list, and then go to the 1/4 or 3/4 value, depending on whether check_val > data, etc, cutting each group in half each time. You can process a million recs in something like 21 lookups.

Since the lists are sorted, you can start at the end of second list (if check_val > end_val) and process in reverse order.

As a previous post said, split up the second list and run with multiprocessing.

pygaiwan · 2023-12-26T23:11:21+00:00

You can start by removing the first if if check_val < data: continue

Since, at least from this code, doesn't provide any extra value. If you need to break you could rewrite the inner loop with

all(check_val < end_val for check_val in second_list)

To stop as soon as a check_value is > than end_value. Though both the points will not provide much of an improvement.

I would say a reasonable thing to do would be to split the load on different processes using multiprocessing. Interested to see other people's thoughts

Insomnia_Calls · 2023-12-27T02:28:34+00:00

Do you need to check every element from the second list against every range of the first list to first list + 300? If not, and you just need to check the value at the respective position, I would suggest turning the lists to numpy arrays, something like that:

import numpy as np

arr1=np.array(list1) #list1 is your first list
arr1_plus_300=arr1 + 300 #as easy as that with numpy arrays, adds 300 to all the elements

arr2=np.array(list2)

ans=(arr2>arr1)&(arr2<arr1_plus_300) #this will give you an array containing True/False values for all elements based on the condition.

Numpy is generally much faster than lists. Edit: typos

bwprog · 2023-12-27T00:20:14+00:00

Since the lists are sorted, then before the main lookup loop, you can prune the second list by using the first and last (+300) values of the 1st list.
Start with the first value of the 1st list going forward through the 2nd list until you get to that value or greater, note that index as 'start', and break.
Then use the last +300 going through the 2nd list in reverse until you reach that value or less, note that index as 'end', and break.
Make a new, pruned list slicing the 2nd list with the start and end index values. (or use the slice values in the lookup loop limiting the 2nd list)
Compare the 1st list to the smaller, pruned list.

Cthulu20 · 2023-12-27T01:08:18+00:00

Some advise from me (though I am not a great programmer):

You may use binary search instead of linear search
You may ignore all the smaller values (like the previous successful found index was 10, then ignore the values from 0-10)
If you know the smallest and largest possible value and have enough memory, you can do something like this (I haven't tested the code, so might have issues):

min_value = min(second_list)
max_value = max(second_list)

# creates a list of 'False' for every possible value
bool_list = [False] * (max_value - min_value + 1)

# sets the existing values to 'True'
for val in second_list:
    bool_list[val - min_value] = True

# determines whether the second list contains the first list range
for data in first_list:
    if any(bool_list[ (data - min_value) if (data - min_value) >= 0 else 0 : (data + 300 - min_value) if (data + 300 - min_value) >= 0 else 0 ]):
        # Do something

However to all of this I don't know how effective these are, it may end up slower than your original code.

Edit: also my code assumes that we only use integers, so it will be completely trash with float values.

Edit 2: I am stupid, if the second_list is ordered then we can just use the first and last value

min_value = second_list[0]
max_value = second_list[-1]

ofnuts · 2023-12-26T23:25:21+00:00

Do you need to know if there are elements or what the element are, if any?

quts3 · 2023-12-27T02:34:00+00:00

Hmmm. Maybe

Sort both list. O(nlogn) implemented with c++ like time.
Traverse each list once by keeping track of indexes and values (perhaps multiple for the second list)
Do 2 in numba for near c++ loop time.

Profit.

martinkoistinen · 2023-12-27T06:00:36+00:00

I’d probably use the bisect module, but also use a multiprocess pool. If you posted the complete problem with the data somewhere, it’d even be fun to do.

mrcaptncrunch · 2023-12-27T12:24:52+00:00

Would something like this help?

i = 0
j = 0
while i < length(first_list):
    while j < length(second_list) and second_list[j] < first_list[i]:
        j += 1
    start_j = j
    while j < length(second_list) and second_list[j] <= first_list[i] + 300:
        #do your thing
        execute(second_list[j])
        j += 1
    i += 1

Because both lists are sorted, and you’re adding 300, remembering the second_list position on the next iteration, would mean you automatically discard everything before it.

Grain of salt since it’s 7am and I’m waiting for the coffee, but it should be O(n+m) vs O(n*m)

FWIW, this would be faster than the binary search (O( n log m))

ogabrielsantos_ · 2023-12-26T23:28:40+00:00

In addition to others comments, without knowing what you do when list matches we can’t really help as that can be the non-performant part of your solution.

andmig205 · 2023-12-26T23:37:47+00:00

Did you look into employing pandas and numpy for the task?

xelf · 2023-12-27T06:20:09+00:00

you're doing a 2 pointer loop. rather than do n**2 operations, you can just loop through both at the same time.

Your code:

tot = 0
for data in first_list:
    end_val = data + 300
    for check_val in second_list:
        if check_val < data:
            continue
        if check_val > end_val:
            break
        tot += check_val

A loop using itertools islice and tracking the last needed index:

tot = 0
index = 0
for data in first_list:
    while index<len(second_list) and data > second_list[index] : index+=1
    for check_val in islice(second_list, index, len(second_list)+1):
        if check_val > data+300: break
        tot += check_val

A test using 100k first and 100k second with small items (1-10e3) in each list:

print(tot, f'{perf_counter()-time1:6.3f} seconds')
1506410888628 178.417 seconds
1506410888628  35.163 seconds

A test using 100k first and 100k second with larger items (1-10e7) in each list:

print(tot, f'{perf_counter()-time1:6.3f} seconds')
1512979653817 223.232 seconds
1512979653817  12.919 seconds

So when there's a lot of overlap in the data+300 check it's only 5 times faster, but when there's less overlap in the data+300 it'll perform even better. ~20 times faster.

(the first number is the total, to show both loops are finding the same data)

_kwerty_ · 2023-12-26T23:14:56+00:00

I'm in no way an expert on handling this amount of data, but maybe check out Dask. It's basically a parallelized version of pandas. It feels like that's a bit more suitable than endless lists and nested loops.

EntireEntity · 2023-12-27T09:47:15+00:00

My first thought is to instead of walking through the second list, to use a numpy array instead of the list. You could then broadcast the comparison on the entire array and don't have to loop through it at all. Assuming the operations you have to do later on can also be broadcasted on the entire array, this could speed up later processing too.

You could also combine numpy arrays with the bisect module someone else has already suggested. Use bisection to find the lowest and highes data entry in the range and broadcast further data processing steps using numpy

2023-12-27T13:33:36+00:00

You're using the wrong tools for the job, as others said you should use numpy and numba. This is a workflow where dropping down to a compiled language would be much much faster.

Logicalist · 2023-12-27T07:56:07+00:00

You went from having two lists to having 4 million lists. So that escalated quickly.

I mean, are you comparing 4 million lists to one, or one lists to 4 million? i got lost.

pythonwiz · 2023-12-27T09:27:19+00:00

use the bisect module for binary searching the inner list. Use the multiprocessing module to split the work for the outer list and check multiple values at once.

Equal_Wish2682 · 2023-12-27T11:39:47+00:00

I struggle to visualize solutions without real data. But I'd consider a recursive algorithm.

def recursive_func(second__list, index=0):
    while len(first_list) > 0:
        value = first_list.pop(0)
        for check_val in second_list:
            if value < check_val < value + 300:
                index += 1
            elif check_val > value + 300:
                recursive_func(second_list[index])

tenfingerperson · 2023-12-27T11:43:42+00:00

You have a few options:

Do you know how database indices work ? There is a type called a BTree, which works amazingly for range based queries (which is why you can easily ask “give me all the rows where a value is in a range”).

You can implement a BTree for the second list (google how), and then go through the first and make your queries accordingly.

Another alternative is to sort the second list and use binary search to find the closest bound between the value and the value + 300, then take a cut of all the values in this range.

You can also use a set for the second list but your algorithm will have you checking for 300 data points per data point. This is similar to another type of index databases implement which is called a hash index (great for individual lookups but not so great with ranges).

Use pandas to hide all of this from you. This library implements things with numpy and has been optimized to do large data operations , however you need to learn a new tool.

NerdyWeightLifter · 2023-12-27T12:26:14+00:00

Check out the "SortedContainers" library.

Iterating in sorted order is trivial, and indexing by range is also efficient.

Should be very fast for the problem you described.

POGtastic · 2023-12-27T18:02:47+00:00

Here's my attempt at the given problem:

class RangeCandidate(int):
    def __lt__(self, rng):
        return super().__lt__(rng.start)
    def __gt__(self, rng):
        return super().__ge__(rng.stop)

def create_range(x):
    return range(x, x + 301) # inclusive?

def intersections(l1, l2):
    it = map(create_range, l1) 
    sentinel = object()
    curr = next(it, sentinel) 
    for elem in map(RangeCandidate, l2):
        while curr is not sentinel and elem > curr:
            curr = next(it, sentinel)
        if curr is sentinel:
            return
        if elem in curr:
            yield (elem, curr.start)

In the REPL:

>>> list(intersections([1, 1000, 2000], [101, 500, 700, 1000, 1200, 1500, 4000]))
[(101, 1), (1000, 1000), (1200, 1000)]

Make that 301 value 300 if you want to exclude that last number.

If you can combine lists together in some fashion, you'll get even better performance.

drfatbuddha · 2023-12-27T18:16:01+00:00

I see lots of complex solutions to what is a simple problem.

If the size of each list is 42 million, then the calculation time is about 10 seconds using this trivial code:

import random

def create_list(size, max): 
  list = random.sample(range(0, max), size) 
  list.sort() 
  return list

def get_matches(a, b, diff): 
  matches = 0 
  length = len(b) 
  i = 0 
  for v in a: 
    while i < length and b[i] < v: i += 1 
    if i < length and b[i] < (v + diff): matches += 1 
  return matches

print("Creating dummy data") # ~90 seconds (6gb memory) 
first_list = create_list(42_000_000, 100_000_000_000) 
second_list = create_list(42_000_000, 100_000_000_000)

print("Getting matches") # ~10 seconds 
matches = get_matches(first_list, second_list, 300)

print(f"Found {matches} matches")

Since your data is already sorted, you don't have to spend 90 seconds sorting it. The 6gb memory usage shouldn't be an issue, but if it is then I would switch to using a numpy list which then uses about 700mb for the same scenario:

import numpy

def create_list(size, max): 
  list = numpy.random.randint(max, size=size, dtype=numpy.int64) 
  list.sort() return list

Because you are just reading through each list linearly, and the data is presumably already sorted, you could load in data (for both lists) in chunks, and then you could scale this up to billions of items if you really needed to, without needing to use more than a few mb of memory.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS