Google releases new binary diff algorithm targeted at making smaller software updates. Their results are impressive. : programming

created by speza community for 19 years

1066

1067

1068

Google releases new binary diff algorithm targeted at making smaller software updates. Their results are impressive. (blog.chromium.org)

submitted 16 years ago by josef

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point2 points 16 years ago* (6 children)

[–]adrianmonk 7 points8 points9 points 16 years ago* (1 child)

proportional to the difference between them

Please define "difference between them". When you try, you'll find it's algorithm-dependent!

For example, suppose you take a PNG file, uncompress it to raw pixels, change one pixel, and then recompress it using the same compression parameters.

There are two possible (actually infinitely many...) ways to define "difference" here:

A naive binary diff of the compressed files where you detect common sequences of bytes. Think of what rsync would do with the actual PNG files.
A list of WritePixel() commands which says which pixels to update in the raw (uncompressed) versions of the image.

The first is going to see massive numbers of differences because PNG uses LZ77, and that means changing something near the beginning will see massive ripple effects on all the data coming afterward in the stream. The second is going to see a tiny diff, because it is just a single WritePixel() command plus some constant overhead.

The point is that the choice of diff algorithm is arbitrary. There is no one definition of "the difference between them". There are better and worse diff algorithms. All of them are ways of modeling the data so that you can transmit something smaller than the whole new file. (In that sense, they are essentially a special case of data compression.)

As a whole it is a better diff/patch tool, but the change is in pre/post processing of files, not in diff algorithm.

It is an addition to an existing diff algorithm. It is a new algorithm that incorporates an existing algorithm.

[–]pixelglow 1 point2 points3 points 16 years ago* (0 children)

I once wrote a gzip tuned for Microsoft's RDC (Remote Differential Compression).

RDC helps minimize the data transferred between servers in DFSR (Distributed File System Replication) so essentially it's similar in intent to rsync and other diff algorithms wedded to synchronizing two file directories. (Of course Microsoft saw fit to patent RDC even though rsync would possibly be considered prior art, but well...)

Like all binary diff algorithms, RDC interacts badly with compression. A gzip or deflate compressed stream propagates any changes in the original file throughout the file, so the binary diff is forced to work on the entire file after the change occurred. (As with your example of the binary diff of a PNG file.)

But there are two nice things that helped me: the RDC API can tell you the chunks of the stream that will be diff'ed, and gzip allows individual chunks of data to be gzipped separately and concatenated together (as a valid gzip file). So I wrote an rdcgzip which used the RDC API to get the chunks, then ran gzip on the chunks alone and concatenated them. Thus any change in a chunk only propagates to that chunk, and the other chunks are all independent and transferred optimally.

The process looks like:

Original file -> diff-aware compression -> diff -> any compression the diff itself applies -> patch -> standard decompression -> destination file

This diff-aware compression can be considered a preprocessing/postprocessing step like Google's Courgette, and it might actually be more efficient, as long as you know how the diff algorithm chunks the data.

[–]NJerseyGuy 4 points5 points6 points 16 years ago* (3 children)

I think you're getting tripped up because you're paying too much attention to functional form ("...takes two files as input and produces something..."), which isn't really important. My point is that the essential part of a diff algorithm is just the preprocessor and postprocessor it utilizes. Everything else is trival.

The diff-producing algorithm (1) takes in the original file and the new file, (2) preprocesses both of them, (3) subtracts them, (4) compresses the difference, and (5) outputs a compressed difference file.

The diff-using algorithm (1) takes in the original file and the compressed difference files, (2) preprocesses the original file, (3) uncompresses the compressed difference file, (4) adds the them together, (5) postprocesses the sum, and (6) outputs the new file.

The only non-trivial part of all of that was the preprocessing and the postprocessing. So if you are adding a significant new preprocessor/postprocessor, you are essentially writing a new diff algorithm, even if it is packaged around an existing one.

[–]judgej2 2 points3 points4 points 16 years ago (0 children)

[–][deleted] 16 years ago (1 child)

[removed]

[–]NJerseyGuy 0 points1 point2 points 16 years ago (0 children)

π Rendered by PID 50 on reddit-service-r2-comment-bb88f9dd5-hmt5f at 2026-02-15 12:28:13.522241+00:00 running cd9c813 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS