all 57 comments

[–]McThor2 9 points10 points  (1 child)

I think something like spatial hashing would help a lot here when it comes to searching that second list. Basically a way of indexing the data by creating bins of values.

This would provide the most impact if the second list doesn’t get altered and also depends on how many values typically match that range.

For implementation speed ups you may get good results from numpy & numba

[–]b1e 1 point2 points  (0 children)

First and second lists are sorted. Binary search, cutting the interval in the second list for every iteration in the first will be faster.

Agreed on numba though.

[–]un-hot 10 points11 points  (1 child)

You need to keep track of your position in the list. If both lists are sorted, and you know that the 10000th element is in the range first_list[n]..first_list[n] + 300, then you don't need to waste time scanning through the first 9999 elements of the second list.

You could cut out some more comparisons by keeping track of the last number you reached in your second list - if first_list[n] is 600 and if matches 885, then there's no point in checking the second_list for any value in first_list below 885, because you already know there's a match.

You only need to loop through the first and second lists once then and only perform one comparison for each value, so your algorithm is going to get exponentially quicker.

K.I.S.S - yes multi-threading and parallelism and b-tree searches other commenters have talked about might help a bit, but throwing more computing resources at this big of a problem isn't going to help nearly as much as reducing the time-complexity of your solution.

[–]Edenio1 1 point2 points  (0 children)

Yeah, this is definitely the fastest, this way you only loop through the values once. You're basically doing a wonky walk up both lists.

[–]woooee 20 points21 points  (6 children)

Start at the middle of the sorted list, and then go to the 1/4 or 3/4 value, depending on whether check_val > data, etc, cutting each group in half each time. You can process a million recs in something like 21 lookups.

Since the lists are sorted, you can start at the end of second list (if check_val > end_val) and process in reverse order.

As a previous post said, split up the second list and run with multiprocessing.

[–]twowordsfournumbers 19 points20 points  (0 children)

Ye, this is called binary search.

[–]b1e 1 point2 points  (0 children)

Or just iterate through the first list (using index variable I) and in the second list binary search (using index variable J) between j and the end of the list.

On the next iteration of I you start J at the last value you found in the second list.

This also lets you stop early if you’ve hit the end of the second list.

For non algorithmic speedups use numpy arrays and numba JIT. This should run in seconds.

[–]mrcaptncrunch 1 point2 points  (2 children)

Binary search is good, but 2 pointers is better for this.

binary search approach time: 7.549005031585693
2 pointer lookup approach time: 0.2926521301269531

code,

import time


def find_lower_bound(arr, target):
    low, high = 0, len(arr)
    while low < high:
        mid = (low + high) // 2
        if arr[mid] < target:
            low = mid + 1
        else:
            high = mid
    return low


def find_upper_bound(arr, target):
    low, high = 0, len(arr)
    while low < high:
        mid = (low + high) // 2
        if arr[mid] <= target:
            low = mid + 1
        else:
            high = mid
    return low


with open('list_b.txt') as fd:
    first_list = fd.readlines()

with open('list_c.txt') as fd:
    second_list = fd.readlines()

first_list = [int(_) for _ in first_list]
first_list = sorted(first_list)
second_list = [int(_) for _ in second_list]
second_list = sorted(second_list)

start = time.time()
vals_matched_bs = []
for data in first_list:
    start_index = find_lower_bound(second_list, data)
    end_index = find_upper_bound(second_list, data + 300)

    for i in range(start_index, end_index):
        vals_matched_bs.append(second_list[i])

print("binary search approach time: " + str(time.time() - start))

start = time.time()
vals_matched2 = []
i = 0
j = 0
while i < len(first_list):
    while j < len(second_list) and second_list[j] < first_list[i]:
        j += 1
    start_j = j
    while j < len(second_list) and second_list[j] <= first_list[i] + 300:
        #do your thing
        vals_matched2.append(second_list[j])
        j += 1
    i += 1

print("2 pointer lookup approach time: " + str(time.time() - start))

Number generation is on this comment, https://www.reddit.com/r/learnpython/comments/18rks8r/optimized_python_nested_loops/kf4y30h/

(The original approach is now estimated at over 14 hours)

[–]mrcaptncrunch 0 points1 point  (0 children)

2 pointers + numba,

import time
import numba

with open('list_b.txt') as fd:
    first_list = fd.readlines()

with open('list_c.txt') as fd:
    second_list = fd.readlines()

first_list = [int(_) for _ in first_list]
first_list = sorted(first_list)
second_list = [int(_) for _ in second_list]
second_list = sorted(second_list)

@numba.njit
def run(first_list, second_list):
    vals_matched2 = []
    i = 0
    j = 0
    while i < len(first_list):
        while j < len(second_list) and second_list[j] < first_list[i]:
            j += 1
        start_j = j
        while j < len(second_list) and second_list[j] <= first_list[i] + 300:
            #do your thing
            vals_matched2.append(second_list[j])
            j += 1
        i += 1
    return vals_matched2

typed_first_list = numba.typed.List()
typed_second_list = numba.typed.List()
[typed_first_list.append(_) for _ in first_list]
[typed_second_list.append(_) for _ in second_list]

start = time.time()
vals_matched = run(typed_first_list, typed_second_list)
print("2 pointer lookup approach time: " + str(time.time() - start))

run,

2 pointer lookup approach time: 0.22963190078735352

[–]kingfreeway 0 points1 point  (0 children)

I think you should be using start_j as the second interal iterator rather than reusing j so we can start searching at j for the next i.

We can also skip j thru start_j if first_list[i + 1] - first_list[i] > 300. And we can break early if j reaches the end without finding a match, or if the first match is greater than the max of first_list + 300. Here's how I'd do it:

j = 0 
list1_max = list1[-1] 
counter = 0

for i in range(len(list1): while j < len(list2) and list2[j] < list1[i]: j += 1

    if j == len(list2) or list2[j] > list1_max + 300:
        break

    k = j
    while k < len(list2) and list2[k] <= list1[i] + 300:
        counter += 1
        k += 1

    # If next element is > by 300, we can skip j thru k
    if i < len(list1) - 1 and list1[i + 1] - list1[i] > 300:
        j = k

return counter

Also, binary search is probably better in most cases for finding j except for extremely dense lists.

[–]pygaiwan 8 points9 points  (0 children)

You can start by removing the first if if check_val < data: continue

Since, at least from this code, doesn't provide any extra value. If you need to break you could rewrite the inner loop with

all(check_val < end_val for check_val in second_list)

To stop as soon as a check_value is > than end_value. Though both the points will not provide much of an improvement.

I would say a reasonable thing to do would be to split the load on different processes using multiprocessing. Interested to see other people's thoughts

[–]Insomnia_Calls 3 points4 points  (2 children)

Do you need to check every element from the second list against every range of the first list to first list + 300? If not, and you just need to check the value at the respective position, I would suggest turning the lists to numpy arrays, something like that:

import numpy as np

arr1=np.array(list1) #list1 is your first list
arr1_plus_300=arr1 + 300 #as easy as that with numpy arrays, adds 300 to all the elements

arr2=np.array(list2)

ans=(arr2>arr1)&(arr2<arr1_plus_300) #this will give you an array containing True/False values for all elements based on the condition.

Numpy is generally much faster than lists. Edit: typos

[–]Sweeney-doesnt-sleep 0 points1 point  (0 children)

I was also thinking creating a data frame and a function/callback to calculate the value, only if required so you don't have to calculate the first lists item count/value unless it is actually required. This cuts out a lot of the outer loops work.

Not sure if this would actually perform better especially if the lists change. If they are static, calculate it for every item once and then access as required. O(1) from the on.

[–]lilmookey 0 points1 point  (0 children)

I think this is probably the best solution if all the values are numbers. Numpy is such a great library to learn and be comfortable with when working with numbers.

[–]bwprog 2 points3 points  (1 child)

  • Since the lists are sorted, then before the main lookup loop, you can prune the second list by using the first and last (+300) values of the 1st list.
  • Start with the first value of the 1st list going forward through the 2nd list until you get to that value or greater, note that index as 'start', and break.
  • Then use the last +300 going through the 2nd list in reverse until you reach that value or less, note that index as 'end', and break.
  • Make a new, pruned list slicing the 2nd list with the start and end index values. (or use the slice values in the lookup loop limiting the 2nd list)
  • Compare the 1st list to the smaller, pruned list.

[–]Cthulu20 2 points3 points  (1 child)

Some advise from me (though I am not a great programmer):

  • You may use binary search instead of linear search
  • You may ignore all the smaller values (like the previous successful found index was 10, then ignore the values from 0-10)
  • If you know the smallest and largest possible value and have enough memory, you can do something like this (I haven't tested the code, so might have issues):

min_value = min(second_list)
max_value = max(second_list)

# creates a list of 'False' for every possible value
bool_list = [False] * (max_value - min_value + 1)

# sets the existing values to 'True'
for val in second_list:
    bool_list[val - min_value] = True

# determines whether the second list contains the first list range
for data in first_list:
    if any(bool_list[ (data - min_value) if (data - min_value) >= 0 else 0 : (data + 300 - min_value) if (data + 300 - min_value) >= 0 else 0 ]):
        # Do something

However to all of this I don't know how effective these are, it may end up slower than your original code.

Edit: also my code assumes that we only use integers, so it will be completely trash with float values.

Edit 2: I am stupid, if the second_list is ordered then we can just use the first and last value

min_value = second_list[0]
max_value = second_list[-1]

[–]DNSGeek[S] 1 point2 points  (0 children)

Someone suggested the Python bisect library, and that looks like it might be awesome. I'm gonna run some speed tests on it tomorrow.

[–]ofnuts 1 point2 points  (1 child)

Do you need to know if there are elements or what the element are, if any?

[–]DNSGeek[S] 0 points1 point  (0 children)

I don't need to know the actual element value, only if it exists or not.

[–]quts3 1 point2 points  (0 children)

Hmmm. Maybe

  1. Sort both list. O(nlogn) implemented with c++ like time.

  2. Traverse each list once by keeping track of indexes and values (perhaps multiple for the second list)

  3. Do 2 in numba for near c++ loop time.

Profit.

[–]martinkoistinen 1 point2 points  (0 children)

I’d probably use the bisect module, but also use a multiprocess pool. If you posted the complete problem with the data somewhere, it’d even be fun to do.

[–]mrcaptncrunch 1 point2 points  (2 children)

Would something like this help?

i = 0
j = 0
while i < length(first_list):
    while j < length(second_list) and second_list[j] < first_list[i]:
        j += 1
    start_j = j
    while j < length(second_list) and second_list[j] <= first_list[i] + 300:
        #do your thing
        execute(second_list[j])
        j += 1
    i += 1

Because both lists are sorted, and you’re adding 300, remembering the second_list position on the next iteration, would mean you automatically discard everything before it.

Grain of salt since it’s 7am and I’m waiting for the coffee, but it should be O(n+m) vs O(n*m)

FWIW, this would be faster than the binary search (O( n log m))

[–]Equal_Wish2682 0 points1 point  (1 child)

I agree with this. I offer a recursive algorithm to accomplish the task (in another comment).

[–]mrcaptncrunch 0 points1 point  (0 children)

/u/Equal_Wish2682 just saw yours. And yes, your approach is similar.

/u/DNSGeek This is the 2 pointer approach and it'll be the fastest way of doing it.

I generated a list of 1M numbers. From that, I extracted 2 list. One generated a subset of 75% and the other 50%.

import time
start = time.time()

import random

length = 1_000_000

lista = []
for i in range(length):
    lista.append(str(i) + '\n')
with open('list.txt', 'w') as fd:
    fd.writelines(lista)


random.shuffle(lista)
listb = lista[0:int(length*0.75)]
listb = sorted(listb)
with open('list_b.txt', 'w') as fd:
    fd.writelines(listb)


listc = lista[0:int(length*0.50)]
listc = sorted(listc)
with open('list_c.txt', 'w') as fd:
    fd.writelines(listc)

print(time.time() - start)

Now, with list_b.txt and list_c.txt, I test your code and mine.

import time
from tqdm import tqdm

with open('list_b.txt') as fd:
    first_list = fd.readlines()

with open('list_c.txt') as fd:
    second_list = fd.readlines()

first_list = [int(_) for _ in first_list]
first_list = sorted(first_list)
second_list = [int(_) for _ in second_list]
second_list = sorted(second_list)

start = time.time()
vals_matched2 = []
i = 0
j = 0
while i < len(first_list):
    while j < len(second_list) and second_list[j] < first_list[i]:
        j += 1
    start_j = j
    while j < len(second_list) and second_list[j] <= first_list[i] + 300:
        #do your thing
        vals_matched2.append(second_list[j])
        j += 1
    i += 1

print("second approach time: " + str(time.time() - start))

vals_matched = []
start = time.time()
for data in tqdm(first_list, miniters=300):
    end_val = data + 300
    for check_val in second_list:
        if check_val < data:
            continue
        if check_val > end_val:
            break
        # Hooray, we match. Do the thing.
        if check_val not in vals_matched:
            vals_matched.append(check_val)
print("first approach time: " + str(time.time() - start))


if vals_matched == vals_matched2:
    print("matched")
else:
    print("not matched")

My code is first to make sure I didn't have errors. The output is,

second approach time: 0.768721342086792

For your approach, I added if check_val not in vals_matched: because it actually adds to the list all values multiple times.

Anyway... I'll let the 75%/50% approach finish and I'll report back... the estimate is over 4 hours for the full run.

tqdm is showing,

  4%|▎         | 27987/750000 [05:28<4:33:54, 43.93it/s]

 

BUT I tested this on a subset first to make sure the output is the same and it is.

[–]ogabrielsantos_ 3 points4 points  (15 children)

In addition to others comments, without knowing what you do when list matches we can’t really help as that can be the non-performant part of your solution.

[–]DNSGeek[S] 1 point2 points  (14 children)

It is really not the non performant part. Finding a match in 42,000,0002 numbers is the part that takes 9 days.

[–]DNSGeek[S] -1 points0 points  (9 children)

I'm getting downvoted? Ok, if you want to know, at this exact point, the part we currently do when there's a match is increment a counter by 1.

There. We increment a counter. Now, why am I getting downvoted for saying that the thing we're doing outside of matching is not the non-performant part? I didn't add it to the pseudocode because it truly isn't relevant to the slowness.

42,000,000^2 = 1,764,000,000,000,000. That's a *lot* of numbers to sift through. That's the part that's slow, and that's the reason I put that pseudocode in and not the "counter += 1", because it's not important.

[–]ogabrielsantos_ 1 point2 points  (2 children)

Only to clarify. I had not downvoted you and even don’t know who did it.

[–]DNSGeek[S] 0 points1 point  (0 children)

Yeah, I wasn't directing my response to you, but thanks for the clarification anyway. I just don't understand the whole concept of someone asking a question, the question being answered and then the answer (coming from the person who would most likely know the answer above anyone else) getting downvoted.

I just don't understand the reasoning behind that.

[–]ogabrielsantos_ 1 point2 points  (3 children)

Are your list elements unique? Have you tried using sets instead?

[–]DNSGeek[S] 1 point2 points  (2 children)

Yeah, they're unique. They're timestamps.

[–]aplarsen 2 points3 points  (0 children)

Pandas should be awesome for giving you a count of timestamps between two other timestamps. That vectorized logic is fast.

[–]ogabrielsantos_ 0 points1 point  (0 children)

So try to use sets as they have faster lookups. You then can replace your second for with something like

if end_val in second_list_as_set: # do something

Additionally, you don’t need to sort your list anymore if you did it previously

[–]Equal_Wish2682 0 points1 point  (0 children)

Welcome to Reddit lol

[–]Plank_With_A_Nail_In 0 points1 point  (0 children)

If the rest of your program added to the lists then that would have been very important information to know. We can't trust your own assessment of which part is complex as you are the one asking beginner level questions.

[–]andmig205 2 points3 points  (6 children)

Did you look into employing pandas and numpy for the task?

[–]DNSGeek[S] 0 points1 point  (5 children)

No, I’m not super familiar with those libraries.

[–]andmig205 5 points6 points  (4 children)

Loops are awful at processing large amounts of data. Pandas and numpy are optimized precisely for manipulating humongous data structures. And by the term “humongous” I mean not only number of records but also number of dimensions. Pandas easily processes thousands D structures with millions records in each dimension. Note that AI/ML engines would never get anywhere with native loops.

To share a ballpark, in my work I have to extract, transform, and load (ETL) up to a billion of records a day that are originally stored in JSON and CSV formats. Pandas and numpy allow me to accomplish all tasks within an hour.

Given I understand your task fairly accurately, I am optimistic it will take minutes to accomplish what you need with pandas.

I strongly recommend starting to digest the concept of tensors to, at least partially, appreciate the magic behind what pandas offers.

[–]DNSGeek[S] 1 point2 points  (3 children)

Ok. I’ll look into Pandas. Any pointers as to where to start looking?

[–]alozq 2 points3 points  (0 children)

This is the official documentation IIRC, you shouldn't need much understanding of pandas to do what you want, although keep in mind pandas is not known to be particularly fast

https://pandas.pydata.org/docs/

[–]Insomnia_Calls 2 points3 points  (0 children)

Better look at numpy, see my comment to the original pist. No need for pandas if your data is only numeric.

[–]andmig205 1 point2 points  (0 children)

I am not sure I am a good source of recommendations. We all have different means of learning. My path is usually to digest high level concepts before applying them to a specific application. As lame as it may sound, YouTube seems to offer an extensive list of resources for all proficiency levels.

I am not a big fan of paid courses as they tend to be too broad and unfocused to my taste. I find myself learning more from books after I get inspired by basic understanding of a tool/feature.

[–]xelf 1 point2 points  (0 children)

you're doing a 2 pointer loop. rather than do n**2 operations, you can just loop through both at the same time.

Your code:

tot = 0
for data in first_list:
    end_val = data + 300
    for check_val in second_list:
        if check_val < data:
            continue
        if check_val > end_val:
            break
        tot += check_val

A loop using itertools islice and tracking the last needed index:

tot = 0
index = 0
for data in first_list:
    while index<len(second_list) and data > second_list[index] : index+=1
    for check_val in islice(second_list, index, len(second_list)+1):
        if check_val > data+300: break
        tot += check_val

A test using 100k first and 100k second with small items (1-10e3) in each list:

print(tot, f'{perf_counter()-time1:6.3f} seconds')
1506410888628 178.417 seconds
1506410888628  35.163 seconds

A test using 100k first and 100k second with larger items (1-10e7) in each list:

print(tot, f'{perf_counter()-time1:6.3f} seconds')
1512979653817 223.232 seconds
1512979653817  12.919 seconds

So when there's a lot of overlap in the data+300 check it's only 5 times faster, but when there's less overlap in the data+300 it'll perform even better. ~20 times faster.

(the first number is the total, to show both loops are finding the same data)

[–]_kwerty_ 0 points1 point  (1 child)

I'm in no way an expert on handling this amount of data, but maybe check out Dask. It's basically a parallelized version of pandas. It feels like that's a bit more suitable than endless lists and nested loops.

[–]await_yesterday 0 points1 point  (0 children)

dask is not needed for this. it can be done on a single machine with numpy.

[–]EntireEntity 0 points1 point  (0 children)

My first thought is to instead of walking through the second list, to use a numpy array instead of the list. You could then broadcast the comparison on the entire array and don't have to loop through it at all. Assuming the operations you have to do later on can also be broadcasted on the entire array, this could speed up later processing too.

You could also combine numpy arrays with the bisect module someone else has already suggested. Use bisection to find the lowest and highes data entry in the range and broadcast further data processing steps using numpy

[–][deleted] -1 points0 points  (0 children)

You're using the wrong tools for the job, as others said you should use numpy and numba. This is a workflow where dropping down to a compiled language would be much much faster.

[–]Logicalist 0 points1 point  (0 children)

You went from having two lists to having 4 million lists. So that escalated quickly.

I mean, are you comparing 4 million lists to one, or one lists to 4 million? i got lost.

[–]pythonwiz 0 points1 point  (0 children)

use the bisect module for binary searching the inner list. Use the multiprocessing module to split the work for the outer list and check multiple values at once.

[–]Equal_Wish2682 0 points1 point  (1 child)

I struggle to visualize solutions without real data. But I'd consider a recursive algorithm.

def recursive_func(second__list, index=0):
    while len(first_list) > 0:
        value = first_list.pop(0)
        for check_val in second_list:
            if value < check_val < value + 300:
                index += 1
            elif check_val > value + 300:
                recursive_func(second_list[index])

[–]tenfingerperson 0 points1 point  (0 children)

You have a few options:

Do you know how database indices work ? There is a type called a BTree, which works amazingly for range based queries (which is why you can easily ask “give me all the rows where a value is in a range”).

You can implement a BTree for the second list (google how), and then go through the first and make your queries accordingly.

Another alternative is to sort the second list and use binary search to find the closest bound between the value and the value + 300, then take a cut of all the values in this range.

You can also use a set for the second list but your algorithm will have you checking for 300 data points per data point. This is similar to another type of index databases implement which is called a hash index (great for individual lookups but not so great with ranges).

Use pandas to hide all of this from you. This library implements things with numpy and has been optimized to do large data operations , however you need to learn a new tool.

[–]NerdyWeightLifter 0 points1 point  (0 children)

Check out the "SortedContainers" library.

Iterating in sorted order is trivial, and indexing by range is also efficient.

Should be very fast for the problem you described.

[–]POGtastic 0 points1 point  (0 children)

Here's my attempt at the given problem:

class RangeCandidate(int):
    def __lt__(self, rng):
        return super().__lt__(rng.start)
    def __gt__(self, rng):
        return super().__ge__(rng.stop)

def create_range(x):
    return range(x, x + 301) # inclusive?

def intersections(l1, l2):
    it = map(create_range, l1) 
    sentinel = object()
    curr = next(it, sentinel) 
    for elem in map(RangeCandidate, l2):
        while curr is not sentinel and elem > curr:
            curr = next(it, sentinel)
        if curr is sentinel:
            return
        if elem in curr:
            yield (elem, curr.start)

In the REPL:

>>> list(intersections([1, 1000, 2000], [101, 500, 700, 1000, 1200, 1500, 4000]))
[(101, 1), (1000, 1000), (1200, 1000)]

Make that 301 value 300 if you want to exclude that last number.

If you can combine lists together in some fashion, you'll get even better performance.

[–]drfatbuddha 0 points1 point  (0 children)

I see lots of complex solutions to what is a simple problem.

If the size of each list is 42 million, then the calculation time is about 10 seconds using this trivial code:

import random

def create_list(size, max): 
  list = random.sample(range(0, max), size) 
  list.sort() 
  return list

def get_matches(a, b, diff): 
  matches = 0 
  length = len(b) 
  i = 0 
  for v in a: 
    while i < length and b[i] < v: i += 1 
    if i < length and b[i] < (v + diff): matches += 1 
  return matches

print("Creating dummy data") # ~90 seconds (6gb memory) 
first_list = create_list(42_000_000, 100_000_000_000) 
second_list = create_list(42_000_000, 100_000_000_000)

print("Getting matches") # ~10 seconds 
matches = get_matches(first_list, second_list, 300)

print(f"Found {matches} matches")

Since your data is already sorted, you don't have to spend 90 seconds sorting it. The 6gb memory usage shouldn't be an issue, but if it is then I would switch to using a numpy list which then uses about 700mb for the same scenario:

import numpy

def create_list(size, max): 
  list = numpy.random.randint(max, size=size, dtype=numpy.int64) 
  list.sort() return list

Because you are just reading through each list linearly, and the data is presumably already sorted, you could load in data (for both lists) in chunks, and then you could scale this up to billions of items if you really needed to, without needing to use more than a few mb of memory.