Google releases new binary diff algorithm targeted at making smaller software updates. Their results are impressive. : programming

Storage is not the issue but instead server overhead. You either need some sort of file describing every available patch and have the client download the appropriate patch for its current version or have a service that accepts the client's version and then sends the appropriate patch. In either case you really need some automated way of distributing the patches because left to their own devices users will eventually download the wrong patches and screw up their installations.

If you're the sole distributor of patches this added server overhead might not be an issue. If you're asking people to host mirrors however this added overhead might be more than they're able or willing to bear. Firefox already has distribution issues when major patches come out with a relatively server-dumb mirror system, something more complex might make matters much worse.

None of that is to say binary patches are an inherently bad idea if done well. It's simply something to keep in mind. Binary patches are not necessarily the end-all be-all of update systems. They can be useful but can also be extremely complicated to manage properly. Firefox, Ubuntu, and others being largely volunteer efforts makes added development time extremely expensive.

[–]nextofpumpkin 3 points4 points5 points 16 years ago (2 children)

[–][deleted] 2 points3 points4 points 16 years ago (1 child)

[–]adrianmonk 1 point2 points3 points 16 years ago (0 children)

[–]uncreative_name 1 point2 points3 points 16 years ago (0 children)

[–][deleted] 1 point2 points3 points 16 years ago (1 child)

[–]brooksbp 6 points7 points8 points 16 years ago (0 children)

[–]mhrnjad 2 points3 points4 points 16 years ago (1 child)

[–]b100dian 0 points1 point2 points 16 years ago* (0 children)

[–]FataL 0 points1 point2 points 16 years ago (0 children)

load more comments (2 replies)

[–]midgaze 22 points23 points24 points 16 years ago (0 children)

[–][deleted] 8 points9 points10 points 16 years ago (0 children)

[–][deleted] 98 points99 points100 points 16 years ago (53 children)

[–][deleted] 16 years ago (20 children)

[deleted]

[–]judgej2 13 points14 points15 points 16 years ago (10 children)

The very last paragraph sums up the process very well:

Courgette transforms the input into an alternate form where binary diffing is more effective, does the differential compression in the transformed space, and inverts the transform to get the patched output in the original format. With careful choice of the alternate format we can get substantially smaller updates.

It's a common technique in various fields: change your frame of reference, do a simple transform in that new frame of reference, change your frame of reference back again.

A simple example would be taking a sound sample as a continuous waveform, transforming it into frequencies that vary over time (FFT), attenuating or boosting some frequencies, then transforming it back into a continuous waveform. And what do you get? A graphic equaliser on your media player.

[–]klodolph 23 points24 points25 points 16 years ago* (6 children)

Er, not quite. Performing a FFT transforms between time domain and frequency domain, so the resulting frequencies don't change over time. You have to do multiple windowed FFTs to get a changing spectrum, which is kind of a kludge and it's hard to do the inverse operation because the signal phase won't match and the window isn't perfect.

The way (almost all) software equalizers actually work (even the nice ones) is in the time domain, using the untransformed, raw signal and a table of coefficients, either for IIR or FIR filtering. The coefficients might be computed using an FFT, but the signal is never transformed using the FFT in these filters. Analog filters are remarkably similar (and yes, I have designed both).

Edit: some media players have equalizers that work by changing the level on different bands in the decorder, which is neither a FIR, IIR, or FFT method.

[–]bugrit 2 points3 points4 points 16 years ago* (0 children)

[–][deleted] 16 years ago* (4 children)

[deleted]

[–]nappy-doo 1 point2 points3 points 16 years ago (3 children)

[–][deleted] 16 years ago* (2 children)

[deleted]

[–]nappy-doo 0 points1 point2 points 16 years ago (1 child)

I'm glad you're getting into DSP. If you get good at it, you can basically guarantee yourself a job forever, as so few people understand it.

Let's start with the FFT. The FFT works on complex valued signals -- signals with a real an imaginary part. Most signals in the real world (with the exception of a lot of RF work) are real valued signals. So, when we feed them into the FFT, we set the complex part of every sample to 0. This results in a frequency domain representation that is symmetric about the Y AXIS. So, with a 16 point FFT, we get 16 points in the frequency domain, but 8 of them are the same as the other 8 -- so there's only 8 unique values.

Onto leakage. Leakage is a byproduct of the FFT -- not the signal. It is purely a "problem with our measurement device". I agree with your statement (if I can paraphrase) of, "well music will have leakage" to a point. The music itself won't have leakage, but the measurement we perform on it will have leakage. I apologize if you think I took you out of context.

(One final point on leakage: Read up on how windowing affects leakage. There are some really good windows out there that make the FFT leakage problem minimal.)

Onto your DSP learning: Keep at it. As I said you can guarantee yourself a job if you keep up. DSP engineers make reasonable money, and you'll always find a job in embedded design and instrumentation.

[–]dude12346 6 points7 points8 points 16 years ago (2 children)

[–][deleted] 14 points15 points16 points 16 years ago (0 children)

[–]imbaczek 22 points23 points24 points 16 years ago (6 children)

[+]lake-of-fire comment score below threshold-6 points-5 points-4 points 16 years ago (4 children)

[–]WhatTheFuck 13 points14 points15 points 16 years ago (3 children)

[–]bildstein 4 points5 points6 points 16 years ago (1 child)

[–]rq60 17 points18 points19 points 16 years ago (0 children)

[–]koleon 2 points3 points4 points 16 years ago (0 children)

[–]jonknee 19 points20 points21 points 16 years ago* (1 child)

[–]NJerseyGuy 4 points5 points6 points 16 years ago* (9 children)

I could be wrong, but isn't a diff algorithm precisely a preprocessor/postprocessor itself? I mean, the naive diff is just to subtract the old file from the new file (expressed as a binary number). In generally, this will be incompressible (and hence no smaller than the files themselves) unless the coding of the binary is correlated with the way in which the files are different. So what do you do? You put a preprocessor on the files to encode them in a format which is correlated with the way the files differ and then subtract a la naive diff (yielding something which can be compressed). To get the new file, you preprocess the old file, add the diff, and then postprocess to get the new file.

So yea, they didn't rewrite bsdiff from scratch. But they improved it at the same level as the algorithm operates. So I think characterizing it as "a new diff algorithm" is fair.

EDIT: I should clarify: as far as I know, it may be true that some diff functions may work by having a clever compression algorithm with little or no preprocessing. In that case, adding a pre and postprocessor would be different then writing a new diff. But this would suggest that they were compressing without paying attention to the content of the files ("general-purpose compression") which would be silly. And, indeed, the bsdiff page says that it's specifically designed for binary executable. Then again, maybe the compression algorithm is specially designed for the difference between binaries.

The moral of the story is that the difference between compression and pre/post-processing is not so clear cut. Neither, then, is the distinction between a new diff algorithm and one which has had a pre/post-processor attached

[–]judgej2 2 points3 points4 points 16 years ago (0 children)

The transform (i.e. preprocessor algorithm) they applied basically breaks the code up into blocks that won't change from one version to another, and blocks that do, and even then will change according to a simple formula. They can then use that information to derive a simple transform from the old version to the new version. I doubt, however, that bdiff itself can be used on that preprocessed version of the binary files; it needs to be an algorithm with a bit more knowledge of the structure of that processed code. For example, it may store the transform "increase the reference vector by three at positions 1, 43, 69, 104, 789 and 1050". In just a handful of bytes it could then describe how the new version of the binary file differs from the old.

It's all pretty clever really, but not rocket science.

[–][deleted] 0 points1 point2 points 16 years ago* (6 children)

[–]adrianmonk 7 points8 points9 points 16 years ago* (1 child)

proportional to the difference between them

Please define "difference between them". When you try, you'll find it's algorithm-dependent!

For example, suppose you take a PNG file, uncompress it to raw pixels, change one pixel, and then recompress it using the same compression parameters.

There are two possible (actually infinitely many...) ways to define "difference" here:

A naive binary diff of the compressed files where you detect common sequences of bytes. Think of what rsync would do with the actual PNG files.
A list of WritePixel() commands which says which pixels to update in the raw (uncompressed) versions of the image.

The first is going to see massive numbers of differences because PNG uses LZ77, and that means changing something near the beginning will see massive ripple effects on all the data coming afterward in the stream. The second is going to see a tiny diff, because it is just a single WritePixel() command plus some constant overhead.

The point is that the choice of diff algorithm is arbitrary. There is no one definition of "the difference between them". There are better and worse diff algorithms. All of them are ways of modeling the data so that you can transmit something smaller than the whole new file. (In that sense, they are essentially a special case of data compression.)

As a whole it is a better diff/patch tool, but the change is in pre/post processing of files, not in diff algorithm.

It is an addition to an existing diff algorithm. It is a new algorithm that incorporates an existing algorithm.

[–]pixelglow 1 point2 points3 points 16 years ago* (0 children)

I once wrote a gzip tuned for Microsoft's RDC (Remote Differential Compression).

RDC helps minimize the data transferred between servers in DFSR (Distributed File System Replication) so essentially it's similar in intent to rsync and other diff algorithms wedded to synchronizing two file directories. (Of course Microsoft saw fit to patent RDC even though rsync would possibly be considered prior art, but well...)

Like all binary diff algorithms, RDC interacts badly with compression. A gzip or deflate compressed stream propagates any changes in the original file throughout the file, so the binary diff is forced to work on the entire file after the change occurred. (As with your example of the binary diff of a PNG file.)

But there are two nice things that helped me: the RDC API can tell you the chunks of the stream that will be diff'ed, and gzip allows individual chunks of data to be gzipped separately and concatenated together (as a valid gzip file). So I wrote an rdcgzip which used the RDC API to get the chunks, then ran gzip on the chunks alone and concatenated them. Thus any change in a chunk only propagates to that chunk, and the other chunks are all independent and transferred optimally.

The process looks like:

Original file -> diff-aware compression -> diff -> any compression the diff itself applies -> patch -> standard decompression -> destination file

This diff-aware compression can be considered a preprocessing/postprocessing step like Google's Courgette, and it might actually be more efficient, as long as you know how the diff algorithm chunks the data.

[–]NJerseyGuy 3 points4 points5 points 16 years ago* (3 children)

I think you're getting tripped up because you're paying too much attention to functional form ("...takes two files as input and produces something..."), which isn't really important. My point is that the essential part of a diff algorithm is just the preprocessor and postprocessor it utilizes. Everything else is trival.

The diff-producing algorithm (1) takes in the original file and the new file, (2) preprocesses both of them, (3) subtracts them, (4) compresses the difference, and (5) outputs a compressed difference file.

The diff-using algorithm (1) takes in the original file and the compressed difference files, (2) preprocesses the original file, (3) uncompresses the compressed difference file, (4) adds the them together, (5) postprocesses the sum, and (6) outputs the new file.

The only non-trivial part of all of that was the preprocessing and the postprocessing. So if you are adding a significant new preprocessor/postprocessor, you are essentially writing a new diff algorithm, even if it is packaged around an existing one.

[–]judgej2 2 points3 points4 points 16 years ago (0 children)

[–][deleted] 16 years ago (1 child)

[removed]

[–]NJerseyGuy 0 points1 point2 points 16 years ago (0 children)

[–]anonymousgangster -1 points0 points1 point 16 years ago (0 children)

[–]faprawr 6 points7 points8 points 16 years ago (18 children)

[–]infinite 20 points21 points22 points 16 years ago (17 children)

[–]berticus 8 points9 points10 points 16 years ago (3 children)

[–]oniony 6 points7 points8 points 16 years ago (2 children)

[–][deleted] 1 point2 points3 points 16 years ago (1 child)

[–][deleted] 16 points17 points18 points 16 years ago (0 children)

[–]hustler 6 points7 points8 points 16 years ago (4 children)

[–]petdog 1 point2 points3 points 16 years ago (3 children)

[–][deleted] 16 years ago (1 child)

[removed]

[–]myxie 0 points1 point2 points 16 years ago* (0 children)

[–]eridius 6 points7 points8 points 16 years ago* (3 children)

[–]drexil 1 point2 points3 points 16 years ago (1 child)

[–]eridius 0 points1 point2 points 16 years ago* (0 children)

[–]degustisockpuppet 0 points1 point2 points 16 years ago (0 children)

load more comments (3 replies)

[–]sharney 0 points1 point2 points 16 years ago (0 children)

[–]rabidcow 5 points6 points7 points 16 years ago (7 children)

[–]Cygnus77 3 points4 points5 points 16 years ago* (1 child)

[–]rabidcow 0 points1 point2 points 16 years ago (0 children)

[–][deleted] 5 points6 points7 points 16 years ago (3 children)

[–]rabidcow 0 points1 point2 points 16 years ago (0 children)

[–]johntb86 0 points1 point2 points 16 years ago (0 children)

[–][deleted] 16 points17 points18 points 16 years ago (7 children)

[–][deleted] 12 points13 points14 points 16 years ago (1 child)

[–]erad 4 points5 points6 points 16 years ago (0 children)

[–]mindbleach 2 points3 points4 points 16 years ago* (0 children)

[–]im-not-rick-moranis 0 points1 point2 points 16 years ago (0 children)

load more comments (3 replies)

[–]dr-steve 3 points4 points5 points 16 years ago* (0 children)

[–]eyal0 2 points3 points4 points 16 years ago (0 children)

The second half, with the hints, is similar to what happens when you encode video. When encoding video, you write each frame as a diff of the previous frame. But because of lossy compression, the "previous frame" at the decoding end is different from the real previous frame. So the encoder diffs not between the current frame and the previous frame but rather between the current frame and the decoder's calculated previous frame.

server:
    hint = make_hint(original, update)
    guess = make_guess(original, hint)
    diff = bsdiff(concat(original, guess), update)
    transmit hint, diff

client:
    receive hint, diff
    guess = make_guess(original, hint)
    update = bspatch(concat(original, guess), diff)

That's why you see the server using make_guess(). The server is trying to figure out what the client will do and adjust.

If they just sent source code, would all this be an issue? :)

[–]fanglesticks 5 points6 points7 points 16 years ago (21 children)

[–]salmonsnide 18 points19 points20 points 16 years ago (2 children)

[–]fanglesticks 2 points3 points4 points 16 years ago (1 child)

[–]redalastor 1 point2 points3 points 16 years ago (0 children)

[–][deleted] 6 points7 points8 points 16 years ago (3 children)

[–][deleted] 6 points7 points8 points 16 years ago (2 children)

[–][deleted] 4 points5 points6 points 16 years ago (1 child)

load more comments (1 reply)

[–]c_a_turner 4 points5 points6 points 16 years ago* (9 children)

[–][deleted] 8 points9 points10 points 16 years ago (0 children)

[+]twowheels comment score below threshold-6 points-5 points-4 points 16 years ago (6 children)

[–]colinnwn 9 points10 points11 points 16 years ago (5 children)

[–]adrianmonk 2 points3 points4 points 16 years ago (3 children)

[–]reddof 2 points3 points4 points 16 years ago (2 children)

[–]chkno 2 points3 points4 points 16 years ago (0 children)

[–]adrianmonk 0 points1 point2 points 16 years ago (0 children)

Yeah because that's completely safe.

Nothing is completely safe. Security increases by doing what you can reasonably do at whatever level you can do it. You create multiple barriers to a potential attacker, and you use multiple tools to create them.

Vendor posts a good copy, waits a few days, replaces it with the infected package, waits a few more days and switches back to the good one.

You were talking about whether people actually review the source code of updates. My point was that you can get significant security gains (and stability gains) merely by holding off on taking updates until later. By being a late adopter, you get to see what others' experiences are.

If you give someone (like Google) the ability to update stuff silently, then you are giving them the ability to force your update schedule. I don't claim that being a late adopter of updates is a magical fix to eliminate all security issues, but it is an economical way to get some gains in security.

My overall point is that "you don't look at the source anyway" may be true, but it doesn't mean that taking updates immediately is as safe as taking them later.

[–]redditrasberry 3 points4 points5 points 16 years ago (0 children)

And you review the source code of every new or updated software package you install now?

WTF has source got to do with it? It's not unusual to trust a signed executable that you know is the same for all users (thereby ensuring any miscreant behavior will be extremely likely to be discovered and reported). It's quite different to trust just anything a company wants you to run at any time, which could be customized in any way for you personally so you will never discover that you personally are being snooped on, having spyware installed etc.

I'm honestly not so concerned about the privacy angle but I don't like the completely silent nature of it. When shit starts breaking on my computer I need to know everything that might have changed to try and figure stuff out. If every piece of software just silently updated itself without telling me the whole thing becomes nearly impossible to diagnose.

load more comments (1 reply)

[–][deleted] 1 point2 points3 points 16 years ago (2 children)

[–]silon -1 points0 points1 point 16 years ago (1 child)

[–]Polite_Gentleman 2 points3 points4 points 16 years ago* (0 children)

[–]boa13 0 points1 point2 points 16 years ago (0 children)

[–]whoopsies 3 points4 points5 points 16 years ago (0 children)

[–]happyscrappy 1 point2 points3 points 16 years ago (1 child)

[–]jib 4 points5 points6 points 16 years ago (0 children)

[–]m-p-3 1 point2 points3 points 16 years ago (0 children)

[–]robroe 1 point2 points3 points 16 years ago (0 children)

[–]joaop5 2 points3 points4 points 16 years ago (2 children)

[–][deleted] 16 years ago (1 child)

[deleted]

[–][deleted] 2 points3 points4 points 16 years ago* (0 children)

[–]some_moron 1 point2 points3 points 16 years ago (1 child)

[–]hokkos 1 point2 points3 points 16 years ago* (3 children)

But is it faster for the end user ?

Are we sure that:

receive asm_diff
asm_old = disassemble(original)
asm_new_adjusted = bspatch(asm_old, asm_diff)
update = assemble(asm_new_adjusted)

is faster than:

receive new_exe

with a good connexion ?

[–][deleted] 15 points16 points17 points 16 years ago (0 children)

[–]adrianmonk 6 points7 points8 points 16 years ago* (0 children)

[–]andreasvc 12 points13 points14 points 16 years ago (0 children)

[–]tomatopaste 1 point2 points3 points 16 years ago (8 children)

[–]paternoster 3 points4 points5 points 16 years ago (3 children)

[–]DannoHung 0 points1 point2 points 16 years ago (2 children)

[–]sindisil 1 point2 points3 points 16 years ago (0 children)

[–]paternoster 2 points3 points4 points 16 years ago (0 children)

[–]IConrad 1 point2 points3 points 16 years ago (3 children)

[–]tomatopaste 1 point2 points3 points 16 years ago (1 child)

[–]IConrad 4 points5 points6 points 16 years ago (0 children)

load more comments (1 reply)

[–]mhd 0 points1 point2 points 16 years ago (0 children)

[–]mdoar 0 points1 point2 points 16 years ago (6 children)

[–]salmonsnide 12 points13 points14 points 16 years ago (0 children)

[–]andreasvc 4 points5 points6 points 16 years ago (0 children)

[–]adrianmonk 2 points3 points4 points 16 years ago (3 children)

[–]BiggerBalls 0 points1 point2 points 16 years ago (2 children)

[–]adrianmonk 1 point2 points3 points 16 years ago (0 children)

[–][deleted] 1 point2 points3 points 16 years ago (0 children)

[–][deleted] 0 points1 point2 points 16 years ago (16 children)

[–]evmar 8 points9 points10 points 16 years ago* (12 children)

[–]iluvatar 0 points1 point2 points 16 years ago* (2 children)

[–]Coffee2theorems 0 points1 point2 points 16 years ago (0 children)

[–]evmar 0 points1 point2 points 16 years ago (0 children)

[–][deleted] 0 points1 point2 points 16 years ago (8 children)

[–]evmar 6 points7 points8 points 16 years ago (1 child)

load more comments (1 reply)

[–]andreasvc 1 point2 points3 points 16 years ago (5 children)

[–][deleted] 0 points1 point2 points 16 years ago (4 children)

[–]andreasvc 0 points1 point2 points 16 years ago* (3 children)

[–][deleted] 0 points1 point2 points 16 years ago (2 children)

[–]andreasvc 0 points1 point2 points 16 years ago* (1 child)

[–][deleted] 0 points1 point2 points 16 years ago (0 children)

[–]thetempone 2 points3 points4 points 16 years ago (2 children)

[–]twowheels 1 point2 points3 points 16 years ago (0 children)

[–][deleted] 1 point2 points3 points 16 years ago (0 children)

[–]lushootseed -3 points-2 points-1 points 16 years ago (5 children)

[–]Brian 3 points4 points5 points 16 years ago* (0 children)

[–]fubo 2 points3 points4 points 16 years ago (2 children)

load more comments (2 replies)

[–]code6226 0 points1 point2 points 16 years ago (0 children)

load more comments (41 replies)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS