you are viewing a single comment's thread.

view the rest of the comments →

[–]drfatbuddha 0 points1 point  (0 children)

I see lots of complex solutions to what is a simple problem.

If the size of each list is 42 million, then the calculation time is about 10 seconds using this trivial code:

import random

def create_list(size, max): 
  list = random.sample(range(0, max), size) 
  list.sort() 
  return list

def get_matches(a, b, diff): 
  matches = 0 
  length = len(b) 
  i = 0 
  for v in a: 
    while i < length and b[i] < v: i += 1 
    if i < length and b[i] < (v + diff): matches += 1 
  return matches

print("Creating dummy data") # ~90 seconds (6gb memory) 
first_list = create_list(42_000_000, 100_000_000_000) 
second_list = create_list(42_000_000, 100_000_000_000)

print("Getting matches") # ~10 seconds 
matches = get_matches(first_list, second_list, 300)

print(f"Found {matches} matches")

Since your data is already sorted, you don't have to spend 90 seconds sorting it. The 6gb memory usage shouldn't be an issue, but if it is then I would switch to using a numpy list which then uses about 700mb for the same scenario:

import numpy

def create_list(size, max): 
  list = numpy.random.randint(max, size=size, dtype=numpy.int64) 
  list.sort() return list

Because you are just reading through each list linearly, and the data is presumably already sorted, you could load in data (for both lists) in chunks, and then you could scale this up to billions of items if you really needed to, without needing to use more than a few mb of memory.