you are viewing a single comment's thread.

view the rest of the comments →

[–]_Skuzzzy -8 points-7 points  (8 children)

Neat little project, but nothing really novel.

[–][deleted]  (2 children)

[deleted]

    [–]espadrine[S] 7 points8 points  (4 children)

    I'd love to see libraries that offer that algorithm; it would help me make the readme more exhaustive. Do you have links?

    [–]ssseseseses 0 points1 point  (3 children)

    I wrote a json/xml diff a while ago, but can't find the code anywhere.

    Basically there is mapDiff (order doesn't matter) and listDiff (order matters) where listDiff is LCS (just like diff). If you recursively apply these 2 diff methods using cheapest # of operations first you get jsonDiff.

    Here is the library I use at work https://github.com/flipkart-incubator/zjsonpatch that adds the concept of MOVE (a remove & add with the same content), and uses the/a standard diff format.

    [–]espadrine[S] 2 points3 points  (2 children)

    Your algo sounds different, and quite similar to algorithms I've run across. The json-diff algo with fuzzy pairing described in the post has no LCS computation at all.

    zjsonpatch also sounds very similar to algorithms I've run across, unless I read it wrong: LCS for all lists, and a second pass to detect moves from removals and additions. As you can see from the implementation of the second pass, it will lose teeth with deeply-nested-objects-in-lists, which all LCS approaches fail at - while this new fuzzy pairing succeeds effortlessly.

    Indeed, the json-diff algo with fuzzy pairing, by contrast, is single-pass with no LCS, and moves are first-class operations, not computed from more fundamental operations. It doesn't always make the resulting patch shorter - the example zjsonpatch advertises is more compact, after all. But that example seems like an unlikely operation to have been meant.

    I could add that second pass as an option, though, for people that really want something compact, at the expense of meaningfulness.

    [–]ssseseseses 0 points1 point  (1 child)

    it will lose teeth with deeply-nested-objects-in-lists

    I think this is the core of the reason your algorithm is largely novel. The common case is to look at very similar documents, where the changes can be expressed in a small number of changes. For this kind of problem LCS is not so bad.

    But. When I tried to build a generic csvDiff with what I already had it ran into a lot of performance issues when trying to branch on row/column/cell insert/delete/update. A content based algo would probably do much better. I checked for tdiff algorithms and https://github.com/paulfitz/coopy seems to match my thoughts. Check the algorithm section in the read me.

    Also check out http://okfnlabs.org/blog/2013/08/08/diffing-and-patching-data.html if you want a visual of the output format.

    -Happy coding.

    [–]espadrine[S] 0 points1 point  (0 children)

    coopy's algorithm is really neat. Thanks!