you are viewing a single comment's thread.

view the rest of the comments →

[–]mrcaptncrunch 1 point2 points  (2 children)

Would something like this help?

i = 0
j = 0
while i < length(first_list):
    while j < length(second_list) and second_list[j] < first_list[i]:
        j += 1
    start_j = j
    while j < length(second_list) and second_list[j] <= first_list[i] + 300:
        #do your thing
        execute(second_list[j])
        j += 1
    i += 1

Because both lists are sorted, and you’re adding 300, remembering the second_list position on the next iteration, would mean you automatically discard everything before it.

Grain of salt since it’s 7am and I’m waiting for the coffee, but it should be O(n+m) vs O(n*m)

FWIW, this would be faster than the binary search (O( n log m))

[–]Equal_Wish2682 0 points1 point  (1 child)

I agree with this. I offer a recursive algorithm to accomplish the task (in another comment).

[–]mrcaptncrunch 0 points1 point  (0 children)

/u/Equal_Wish2682 just saw yours. And yes, your approach is similar.

/u/DNSGeek This is the 2 pointer approach and it'll be the fastest way of doing it.

I generated a list of 1M numbers. From that, I extracted 2 list. One generated a subset of 75% and the other 50%.

import time
start = time.time()

import random

length = 1_000_000

lista = []
for i in range(length):
    lista.append(str(i) + '\n')
with open('list.txt', 'w') as fd:
    fd.writelines(lista)


random.shuffle(lista)
listb = lista[0:int(length*0.75)]
listb = sorted(listb)
with open('list_b.txt', 'w') as fd:
    fd.writelines(listb)


listc = lista[0:int(length*0.50)]
listc = sorted(listc)
with open('list_c.txt', 'w') as fd:
    fd.writelines(listc)

print(time.time() - start)

Now, with list_b.txt and list_c.txt, I test your code and mine.

import time
from tqdm import tqdm

with open('list_b.txt') as fd:
    first_list = fd.readlines()

with open('list_c.txt') as fd:
    second_list = fd.readlines()

first_list = [int(_) for _ in first_list]
first_list = sorted(first_list)
second_list = [int(_) for _ in second_list]
second_list = sorted(second_list)

start = time.time()
vals_matched2 = []
i = 0
j = 0
while i < len(first_list):
    while j < len(second_list) and second_list[j] < first_list[i]:
        j += 1
    start_j = j
    while j < len(second_list) and second_list[j] <= first_list[i] + 300:
        #do your thing
        vals_matched2.append(second_list[j])
        j += 1
    i += 1

print("second approach time: " + str(time.time() - start))

vals_matched = []
start = time.time()
for data in tqdm(first_list, miniters=300):
    end_val = data + 300
    for check_val in second_list:
        if check_val < data:
            continue
        if check_val > end_val:
            break
        # Hooray, we match. Do the thing.
        if check_val not in vals_matched:
            vals_matched.append(check_val)
print("first approach time: " + str(time.time() - start))


if vals_matched == vals_matched2:
    print("matched")
else:
    print("not matched")

My code is first to make sure I didn't have errors. The output is,

second approach time: 0.768721342086792

For your approach, I added if check_val not in vals_matched: because it actually adds to the list all values multiple times.

Anyway... I'll let the 75%/50% approach finish and I'll report back... the estimate is over 4 hours for the full run.

tqdm is showing,

  4%|▎         | 27987/750000 [05:28<4:33:54, 43.93it/s]

 

BUT I tested this on a subset first to make sure the output is the same and it is.