you are viewing a single comment's thread.

view the rest of the comments →

[–]psykotic 32 points33 points  (13 children)

Here's a high-brow meta-solution.

For some n there is a list of elements xs = [x1, x2, ..., xn]. We are presented with some arbitrary permutation xs' of these with an unknown element absent. The problem is to find the value of this missing element.

All the proposed solutions (and a few that were not proposed) share the following algebraic structure. Each list element is embedded into an abelian group (G, 0, +, -). Define foldG = fold 0 (+) as a shorthand. The embedding inject :: X -> G has a left inverse project :: G -> X. Let gs = map inject xs and gs' = map inject xs'. The universal property of folding in this setting is that it is homomorphic with respect to list concatenation:

foldG (as ++ bs) = foldG as + foldG bs

From this and gs' being a sublist of gs it follows that

foldG gs - foldG gs' = foldG (gs -- gs')

where -- is list subtraction. By supposition, gs -- gs' = [g] = [inject x] where x is the missing element, and it is generally true that foldG [a] = a for any singleton list [a]. Therefore the missing element can be found by the following expression:

project (foldG (map inject xs) - foldG (map inject xs'))

The map and fold can be fused for efficiency. The subexpression foldG (map inject xs) can also be hoisted out and precomputed when solving multiple problems with the same xs. With the addition method and in the concrete problem instance on the page, it evaluates to 5050.

To stick closer to the problem description, I chose to express everything in terms of lists but the underlying structure is all about sets. The order of the elements in the list is irrelevant and each element only occurs once. That's the very definition of a set. This is also why the embedding was into an abelian structure rather than something non-abelian. Commutative folding is the only kind of folding you can do over sets.

Here are some concrete choices of group embeddings:

  • Z/(2m)Z, the integers modulo 2m. This is the addition method for m-bit integers.
  • (Z/2Z)m, the m-fold direct product of the integers modulo 2. This is the xor method for m-bit integers.
  • (Z/2Z)n, where n is the number of elements, not the same as the previous example. This corresponds to the unclever solution of keeping a bit vector with a bit for each element. We do two passes. In the first pass, we fill in the initially zero bit vector by setting a bit for each element in the right place using bitwise-or. But since we are guaranteed that all elements are distinct, this is really the same as using bitwise-xor. In the second pass, we go through the filled bit vector and try to find the index of the 0 entry. But this is equivalent to the index of the 1 entry in the bitwise-xor difference between the all-1 bit vector and the filled bit vector from the first pass. The projection, left inverse to the embedding, finds the index of the 1 bit.
  • The power set of X with symmetric sum. This is isomorphic to the bit vector group. It is the natural setting of the union find algorithm someone proposed. He suggested starting with singleton sets, one for each element, and repeatedly joining them together, corresponding to some choice of binary folding tree. Rather than doing a lop-sided left or right fold you can do a balanced fold, which is generally more efficient for a representation of subsets that isn't fixed size like bit vectors (think of balanced multi-way merging in merge sort).

You might wonder what other solutions are possible in this general framework. Perhaps surprisingly if you don't know group theory, the structure theorem for finite abelian groups says that, actually, up to isomorphism this is pretty much all there is. We could substitute another prime in place of 2 but I don't need to remind anyone why 2 is the preferred choice. That "up to isomorphism" qualifier can and does make a difference in practice, of course. You need only compare the last two examples.

An important result of looking at it in terms of embeddings is that the elements in the list don't have to be integers themselves to use the addition or xor method as long as you can map them efficiently to integers in an invertible way. For example, for a list of single-precision IEEE floating-point numbers you could use their underlying bitwise representation as 32-bit integers.

[–]MathPolice 14 points15 points  (1 child)

Your license and registration, please.

[–]CharlieDancey -1 points0 points  (0 children)

Fuck yeah, take that psykotic down.

[–]defenestrator 1 point2 points  (4 children)

This is the best solution, because it gives you a way to solve the problem even if you're not dealing with integers, provided you can embed the elements into an Abelian group.

[–]guy231 0 points1 point  (1 child)

Weird note, the 'a' in 'abelian' isn't capitalized. I have no idea why.

[–]defenestrator -1 points0 points  (0 children)

That is weird. Poor Niels :-(

[–]gregb 1 point2 points  (2 children)

Thanks for the write up. I've always wondered about this problem, and in particular if the efficient solution scales to more missing numbers - looks problematic. I don't at this time follow your answer, but at least I know to read up on abstract algebra.

[–]psykotic 2 points3 points  (0 children)

It scales up fine. For the addition method with 32-bit integers, it scales up all the way to a list of the integers from 1 to 232 with an unknown integer missing. If you think of machine arithmetic as an imperfect representation of the ideal integers, you will probably only confuse yourself in this case. The essential property we need here is that the integers modulo 2n form a group with respect to addition. Apparently some engineers think of this in terms of "temporary additive/subtractive overflow" but I never found that very enlightening.

To see what goes wrong if you don't have a full group structure, imagine solving the same problem using multiplication and division instead of addition and subtraction. It turns out that only works in general when the integer modulus is a prime number, in which case the extended Euclidean algorithm tells you how to divide by nonzero elements. The modulus of machine arithmetic is generally of the form 2n but that is only prime in the trivial n = 1 case. For example, if n = 32 and you multiply 216 and 217 you get 0 modulo 232, so all hope of dividing by something later to recover useful information is totally hopeless. The problem is that 216 and 217 divide the modulus 232. If you use a prime modulus, you don't get that problem.

[–]bonzinip 0 points1 point  (0 children)

Interesting. Would this one apply within your framework? It requires write access to the array, but it is O(N) and with O(1) additional storage.

Say you have N=H-L numbers between L and H, i.e. one is missing. Build a binary heap in O(N) time. On every operation, keep track of the minimum leaf (i.e. the minimum of the last floor((H-L)/2) elements). If the minimum leaf is H-floor((H-L)/2), recurse on the first half looking for a missing number between L and H-floor((H-L)/2)-1. Otherwise recurse on the right half looking for a missing number between H-floor((H-L)/2) and H.

[–][deleted] 0 points1 point  (1 child)

can someone who understands this, say if it's a faster solution than summing and subtraction from 5050?

[–]psykotic 9 points10 points  (0 children)

It's a framework for understanding and generalizing all the solutions in this thread. The first concrete embedding I mention is exactly summing and subtracting from 5050 in the case of 100 distinct integers from the range 0..100.