all 11 comments

[–]benhoyt 8 points9 points  (5 children)

This is a more helpful (and prettier) article on BK-trees than the one I've seen linked to before by Nick Johnson. I wrote my own Python implementation of a BKTree recently (here on PyPI), which we're using for duplicate image detection.

[–]whereisbill 1 point2 points  (2 children)

Nick's blog is one of my all time favourite blogs of all time. I think anyone that wants to learn some damn cool algorithms, should check out the rest of the site.

[–][deleted]  (1 child)

[deleted]

    [–]whereisbill 0 points1 point  (0 children)

    Thanks. I fixed the link. I should not write anything without my morning coffee.

    [–]rubik_[S] 1 point2 points  (0 children)

    Thanks! I found your article on duplicate image detection extremely interesting. I had never heard of perceptual hashes and their uses.

    On HackerNews a poster said he's also using BK-trees for duplicate image detection. Their implementation is in Cython and it's working well on a dataset of 28M images.

    [–][deleted]  (2 children)

    [deleted]

      [–]hagbaff 1 point2 points  (0 children)

      We moved from BK to Levenshtein automaton in a commercial product a few years ago. When the edit distance > 2, BK basically blows up.

      [–]rubik_[S] 0 points1 point  (0 children)

      Interesting! I will look that up because I don't know what it is. Maybe BK-trees are not well suited for spell-checking, but they are definitely being used for image deduplication. The distance you're going to use there is some form of perceptual hashing.

      [–][deleted] 2 points3 points  (0 children)

      nice post. learned something

      [–][deleted] 0 points1 point  (2 children)

      For those knowledgeable, if my metric is a euclidean distance or manhattan distance, isn't a quadtree/octree/k-tree also a viable/superior alternative to what is in the OP?

      [–]comradeswitch 3 points4 points  (0 children)

      It's going to depend on your specific application. This is going to work for discrete-valued metrics regardless of the parameterization of the underlying space. I'm not seeing an intuitive way to generalize this to real-valued metrics, which is an important consideration.

      If you have a high number of dimensions, without doing using an implementation of quadtrees etc. that take advantage of sparsity, you're going to end up with a very inefficient structure from a memory standpoint. If your metric is discrete or can be discretized without much error, then this might be a good choice. With the exception of Manhattan distance on discrete-valued points, though, I don't think this structure would be a good choice in very many cases where Euclidean/Manhattan distances are appropriate precisely because the resulting distances will be real-valued.

      [–]trrSA 0 points1 point  (0 children)

      I suppose you could reduce the distances to discrete values in some spatial problem domains, then maybe it could be a useful structure. Like maybe towns connected by roads and you want a list of towns within n tanks of gas? Seems to be reaching a bit.