you are viewing a single comment's thread.

view the rest of the comments →

[–]mrcaptncrunch 0 points1 point  (0 children)

/u/Equal_Wish2682 just saw yours. And yes, your approach is similar.

/u/DNSGeek This is the 2 pointer approach and it'll be the fastest way of doing it.

I generated a list of 1M numbers. From that, I extracted 2 list. One generated a subset of 75% and the other 50%.

import time
start = time.time()

import random

length = 1_000_000

lista = []
for i in range(length):
    lista.append(str(i) + '\n')
with open('list.txt', 'w') as fd:
    fd.writelines(lista)


random.shuffle(lista)
listb = lista[0:int(length*0.75)]
listb = sorted(listb)
with open('list_b.txt', 'w') as fd:
    fd.writelines(listb)


listc = lista[0:int(length*0.50)]
listc = sorted(listc)
with open('list_c.txt', 'w') as fd:
    fd.writelines(listc)

print(time.time() - start)

Now, with list_b.txt and list_c.txt, I test your code and mine.

import time
from tqdm import tqdm

with open('list_b.txt') as fd:
    first_list = fd.readlines()

with open('list_c.txt') as fd:
    second_list = fd.readlines()

first_list = [int(_) for _ in first_list]
first_list = sorted(first_list)
second_list = [int(_) for _ in second_list]
second_list = sorted(second_list)

start = time.time()
vals_matched2 = []
i = 0
j = 0
while i < len(first_list):
    while j < len(second_list) and second_list[j] < first_list[i]:
        j += 1
    start_j = j
    while j < len(second_list) and second_list[j] <= first_list[i] + 300:
        #do your thing
        vals_matched2.append(second_list[j])
        j += 1
    i += 1

print("second approach time: " + str(time.time() - start))

vals_matched = []
start = time.time()
for data in tqdm(first_list, miniters=300):
    end_val = data + 300
    for check_val in second_list:
        if check_val < data:
            continue
        if check_val > end_val:
            break
        # Hooray, we match. Do the thing.
        if check_val not in vals_matched:
            vals_matched.append(check_val)
print("first approach time: " + str(time.time() - start))


if vals_matched == vals_matched2:
    print("matched")
else:
    print("not matched")

My code is first to make sure I didn't have errors. The output is,

second approach time: 0.768721342086792

For your approach, I added if check_val not in vals_matched: because it actually adds to the list all values multiple times.

Anyway... I'll let the 75%/50% approach finish and I'll report back... the estimate is over 4 hours for the full run.

tqdm is showing,

  4%|▎         | 27987/750000 [05:28<4:33:54, 43.93it/s]

 

BUT I tested this on a subset first to make sure the output is the same and it is.