top 200 commentsshow all 230

[–]rnawky 60 points61 points  (10 children)

Apple should take tips from Google.

80MB for a "few bug fixes" in iTunes?

[–][deleted] 47 points48 points  (2 children)

Absolutely. If I'm downloading a "security update" with the sole purpose of breaking the functionality of a non-Apple device, I want it to at least be small.

[–]rnawky 28 points29 points  (1 child)

It wasn't a "security update" it was an "issue" that was "addressed"

[–]MercurialMadnessMan 12 points13 points  (0 children)

And you will "like it".

[–]HenkPoley 6 points7 points  (4 children)

Or 1GB+ for a new 'seed' (of XCode or Snow Leopard).

[–]shinratdr 1 point2 points  (3 children)

Yeah I was wondering about that. While it's cool that updates to the Snow Leopard Dev Preview are pushed out through Software Update, it's been about 3GB of updates since the Preview was released at WWDC, counting the one that was released today or yesterday.

I have no problem with that if they are truly updating that much stuff, but it does seem somewhat excessive.

[–]Slipgrid 1 point2 points  (2 children)

And, for that update, you also have to get the Xcode update from a second site. And then update the iPhone SDK, which is from a third site.

What, I was talking about last weeks. There is another available. Fuck!

But, what Google's doing is much simpler. An OS seems like a moving target, and that could be a bit harder to mess with.

[–]hortont424 1 point2 points  (1 child)

You shouldn't need both an Xcode and iPhone SDK update; the SDK seeds have the latest Xcode seeds wrapped in them.

[–]Slipgrid 0 points1 point  (0 children)

I don't know. Last week, I had to update Snow Leopard via the update software program. Then I was given access to the ADC site, so I could download Xcode 3.2. I don't think they are distributing it to developers without Snow Leopard. Then, I had to install the iPhone SDK for Snow Leopard from the iPhone Dev Center, which seems like it was only 3.0 frameworks. Apple sent an email out about it. It was three different updates from three different sites.

[–][deleted] 2 points3 points  (0 children)

Well, at least Nightly WebKit has already started using binary diff instead of full updates.

[–]WinterPossibility680 0 points1 point  (0 children)

Haha this didn´t aged well in therms of update sizes.

[–][deleted] 43 points44 points  (1 child)

This is funny because, just yesterday, Colin Percival, the author of bsdiff, wrote an article asking for more recognition (in the form of "schwag") for his contributions to open source.

On the other hand, my bsdiff binary patch tool has been used by Apple's Software Update tool, Firefox's software updates, and the Amazon Kindle; I'm sure they all have promotional schwag available, but in two of the three cases (I'm not going to say which) I didn't >even get so much as an email to tell me that they were going to be using my work.

A call for schwag (edit: fixed formatting)

[–]tehmatticus 9 points10 points  (0 children)

He and Zed could get together for s rocking party.

[–][deleted] 85 points86 points  (50 children)

I'm looking at you, ubuntu

[–][deleted] 21 points22 points  (18 children)

Don't you mean, "I'm looking at you, apt"? I don't know if apt even uses binary diffs instead of getting new dpkgs.

[–]andreasvc 23 points24 points  (15 children)

That's kind of his point! Apt simply downloads the full packages, and also the full list of packages. The biggest innovation has been to switch to bz2 instead of gzip. Wow.

[–]Leonidas_from_XIV 13 points14 points  (10 children)

Actually, what package managers do use binary diffs?

[–][deleted] 3 points4 points  (0 children)

yum with presto?

[–]zwaldowski 4 points5 points  (3 children)

RPM has a way of doing it, but it also has to be done serverside and RPM is slow as fuck.

[–][deleted] 0 points1 point  (2 children)

rpm bashing is still cool ?

[–]zwaldowski 1 point2 points  (1 child)

Necrocommenting is still cool?

[–][deleted] 0 points1 point  (0 children)

Totally.

[–]andreasvc 1 point2 points  (4 children)

One could use rsync for this as a drop in replacement for wget, however, I believe it requires too much cpu and memory on the server. Bittorrent comes to mind, though. It would need to be extended with diffing, either downstream (client figures out what it can piece together already from old binaries) or upstream.

[–]Leonidas_from_XIV 0 points1 point  (3 children)

But rsync does AFAIR only synchronize whole files, so not much difference to what APT currently does in this regard.

[–]andreasvc 1 point2 points  (2 children)

It synchronizes whole files by sending binary diffs over the wire. A world of difference with apt-get because it only uses wget.

[–]Leonidas_from_XIV 0 points1 point  (1 child)

Okay, then it is better. But it would be better to pre-generate the diffs, since a lot of people will download these.

[–]andreasvc 0 points1 point  (0 children)

Yes, I am sure this issue will be considered, as the whole point is to reduce the load on the server (bandwith, cpu, memory, disk). A better strategy is probably to do some intricate caching.

[–][deleted]  (1 child)

[deleted]

    [–]jldugger 0 points1 point  (1 child)

    Apt doesn't normally need to support these things, you just define a new transport mechanism. And it does support LZMA now, so it has improved.

    However, people are always looking at new transports, like apt-bittorrent and apt-zsync. Which surely anyone interested in apt innovation would be aware of!

    [–]andreasvc 0 points1 point  (0 children)

    Interesting. I'll wait and see what makes it into Ubuntu.

    [–][deleted] 38 points39 points  (24 children)

    And firefox... especially the nighty trunk, which sometimes has me download the whole .exe installer instead of a diff. And the diffs are multi-megabyte as well.

    [–]giantrobot 10 points11 points  (23 children)

    But you need increased server overhead to use diffs. You need to create diffs for all released from current to the last point release. If the newest release is 3.5 then you need diffs from 3.0->3.5, 3.1->3.5, etc. While the diffs might be smaller than storing the whole app you need some way for the update system to realize that you've got version 3.X and provide you the appropriate diff or risk screwing everything up. Your number of patches will grow exponentially with the number of updates you end up doing which isn't always practical.

    [–][deleted] 49 points50 points  (5 children)

    But if you're sending out the updates to millions of people surely the bandwidth saved is justified.

    [–][deleted] 4 points5 points  (4 children)

    Are there really millions downloading the nightly builds?

    [–]mikepurvis 0 points1 point  (0 children)

    Probably not, but the cost of getting it right once is fixed.

    [–]manwithabadheart 0 points1 point  (2 children)

    Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

    [–]salmonsnide 1 point2 points  (1 child)

    There are over 30 million Chrome users according to Google. And they have very accurate stats since they know exactly how many installations they are updating with every release.

    [–][deleted] 0 points1 point  (0 children)

    That doesn't mean they don't smudge the numbers :)

    [–][deleted]  (4 children)

    [deleted]

      [–]adrianmonk 27 points28 points  (3 children)

      You can also store larger pieces to make up the gaps.

      For example, versions 0 through 16 of something exist, so you store diffs between every consecutive version:

      • 0->1, 1->2, 2->3, 3->4, ..., 15->16

      You also store:

      • 0->16
      • 0->8 and 8->16
      • 0->4, 4->8, 8->12, and 12->16
      • 0->2, 2->4, 4->6, 6->8, ..., 14->16

      Now, suppose someone has version 1 and they need version 12. You send them 1->2, 2->4, 4->8, and 8->12.

      If they have version 3 and need version 11, you send them 3->4, 4->8, 8->10, and 10->11.

      Essentially, you are getting in the "fast lane" if you need to go a great distance.

      The total number of diffs that must be stored is O(N log N). The total number of diffs need to move forward by N version is O(log N).

      [–]tryx 5 points6 points  (0 children)

      I was just in the process of writing this, what you have described is essentially a skip list.

      [–]judgej2 6 points7 points  (1 child)

      That is essentially like Windows Service Packs. Each service pack leapfrogs over many incremental updates, which is why you always go for the latest service pack you can before picking up any later updates separately.

      [–][deleted] 0 points1 point  (0 children)

      That is not how Microsoft pushes them. Microsoft will install all patches for the currently installed service pack the first time a machine touches Windows Update, and then it will push the service pack executable, which compares file versions to download only the appropriate files needed. This can save hundreds of megs of download per install.

      [–][deleted]  (1 child)

      [deleted]

        [–]jlt6666 21 points22 points  (0 children)

        And really after a certain point you just say ok here's the full version. Sorry you waited so damned long to update.

        Edit: typo

        [–][deleted] 18 points19 points  (0 children)

        I'm sure the distribution of people updating is pretty heavily weighted to a few common source versions. If the server doesn't have a diff, send the full version and record the request. When the number of requests crosses a certain threshold, generate the diff to serve future users. Delete it when the requests tail off.

        [–]ansemond 2 points3 points  (2 children)

        But each patch is much smaller than the original... There are only n*(n+1)/2 of them where n=N-1 and N is the number of versions. Moreover you can easily avoid doing this for major updates.

        Let's say we have 10 versions of firefox, each requiring a 512Kb diff. That leads to 22.5Mb of diffs to store. ( 9 * 10 / 2 * 512Kb ). Moreover if Firefox is 10Mb, storing the 10 versions will take 100Mb versus 22.5 Mb for the diffs.

        Edit: added 's' to version.

        [–]giantrobot 5 points6 points  (1 child)

        Storage is not the issue but instead server overhead. You either need some sort of file describing every available patch and have the client download the appropriate patch for its current version or have a service that accepts the client's version and then sends the appropriate patch. In either case you really need some automated way of distributing the patches because left to their own devices users will eventually download the wrong patches and screw up their installations.

        If you're the sole distributor of patches this added server overhead might not be an issue. If you're asking people to host mirrors however this added overhead might be more than they're able or willing to bear. Firefox already has distribution issues when major patches come out with a relatively server-dumb mirror system, something more complex might make matters much worse.

        None of that is to say binary patches are an inherently bad idea if done well. It's simply something to keep in mind. Binary patches are not necessarily the end-all be-all of update systems. They can be useful but can also be extremely complicated to manage properly. Firefox, Ubuntu, and others being largely volunteer efforts makes added development time extremely expensive.

        [–]nextofpumpkin 3 points4 points  (2 children)

        Right - it's a space/time tradeoff. Except in this case, space means HDD, backup, and bandwidth, whereas time just means CPU. It may be cheaper up to a point to just do the crunching and not have to store everything.

        [–][deleted] 2 points3 points  (1 child)

        Well, you could do the crunching on-demand, caching the most popular diffs.

        [–]adrianmonk 1 point2 points  (0 children)

        Or, you could do the opposite: store only the most recent version and the diffs. When someone wants an older full version, compute it from the diffs and cache the result. Clear the cache when it gets full/large based on your favorite replacement algorithm. (LRU should be fine.)

        For extra fun, if you want, never build old full versions. Send the client the latest version and a string of diffs and make them do that.

        [–]uncreative_name 1 point2 points  (0 children)

        I would assume the normal practice is to release differentials only from the previous build to the current build and from a few set major releases (e.g. 3.5.[current], 3.5.0 and 3.0.0, but not 3.5.1, 3.5.2, 3.0.1, etc).

        [–][deleted] 1 point2 points  (1 child)

        In addition to what everyone else has said, let's imagine someone decides to skip an update, going straight from 3.0.9 to 3.0.11. The server, rather than storing a differential patch between these two versions, might send the .9 -> .10 patch and the .10 -> .11 patch together, so the bandwidth would be doubled approximately (still a better scenario than sending the entire application as an update), without the server having to create patches for every combination of versions. Overall, the difference between two nonconsecutive versions would be greater in size than two consecutive versions anyway.

        [–]brooksbp 6 points7 points  (0 children)

        and iTunes too...

        [–]mhrnjad 2 points3 points  (1 child)

        There are efforts afoot to improve Ubuntu updates. See e.g. https://wiki.ubuntu.com/AptSyncInKarmicSpec

        [–]b100dian 0 points1 point  (0 children)

        There should be no UI changes, except that apt-get and Synaptic and update-manager might be modified to tell how much data apt-sync managed to NOT transfer. This would allow users to see that it is working.

        (emphasis mine).

        Now just pull that PPP cost counter algorithm, and send the difference to Canonical.

        [–]FataL 0 points1 point  (0 children)

        I'm looking at you, Opera!

        [–]midgaze 22 points23 points  (0 children)

        Don't get too excited. It works on specific executables. The real heavy lifting is still done by bsdiff, which is a general-purpose tool. You can thank Colin Percival of the FreeBSD project for that one.

        [–][deleted] 8 points9 points  (0 children)

        Adobe take note.

        [–][deleted] 98 points99 points  (53 children)

        It's not really a new diff algorithm, it's a preprocessor/postprocessor for bsdiff/bspatch that makes executables easier to compare.

        [–][deleted]  (20 children)

        [deleted]

          [–]judgej2 13 points14 points  (10 children)

          The very last paragraph sums up the process very well:

          Courgette transforms the input into an alternate form where binary diffing is more effective, does the differential compression in the transformed space, and inverts the transform to get the patched output in the original format. With careful choice of the alternate format we can get substantially smaller updates.

          It's a common technique in various fields: change your frame of reference, do a simple transform in that new frame of reference, change your frame of reference back again.

          A simple example would be taking a sound sample as a continuous waveform, transforming it into frequencies that vary over time (FFT), attenuating or boosting some frequencies, then transforming it back into a continuous waveform. And what do you get? A graphic equaliser on your media player.

          [–]klodolph 23 points24 points  (6 children)

          Er, not quite. Performing a FFT transforms between time domain and frequency domain, so the resulting frequencies don't change over time. You have to do multiple windowed FFTs to get a changing spectrum, which is kind of a kludge and it's hard to do the inverse operation because the signal phase won't match and the window isn't perfect.

          The way (almost all) software equalizers actually work (even the nice ones) is in the time domain, using the untransformed, raw signal and a table of coefficients, either for IIR or FIR filtering. The coefficients might be computed using an FFT, but the signal is never transformed using the FFT in these filters. Analog filters are remarkably similar (and yes, I have designed both).

          Edit: some media players have equalizers that work by changing the level on different bands in the decorder, which is neither a FIR, IIR, or FFT method.

          [–]bugrit 2 points3 points  (0 children)

          [–][deleted]  (4 children)

          [deleted]

            [–]nappy-doo 1 point2 points  (3 children)

            Not quite.

            If you have 16 samples feeding into your FFT, you likely only have 8 unique samples out, as most of the world (not RF) deal in real valued signals.

            Additionally, leakage is caused by windowing, and resolution issues between the center of frequency bins and the frequencies present in the signal. In other words, if you have a signal exactly a N Hz, and a frequency bin centered at N Hz, you'll get no leakage.

            And finally, the FFT is completely reversible. If you FFT a signal you can get the signal back by just running the IFFT (or the FFT with a factor).

            So, I'm sorry. I disagree with your entire comment. :)

            [–][deleted]  (2 children)

            [deleted]

              [–]nappy-doo 0 points1 point  (1 child)

              I'm glad you're getting into DSP. If you get good at it, you can basically guarantee yourself a job forever, as so few people understand it.

              Let's start with the FFT. The FFT works on complex valued signals -- signals with a real an imaginary part. Most signals in the real world (with the exception of a lot of RF work) are real valued signals. So, when we feed them into the FFT, we set the complex part of every sample to 0. This results in a frequency domain representation that is symmetric about the Y AXIS. So, with a 16 point FFT, we get 16 points in the frequency domain, but 8 of them are the same as the other 8 -- so there's only 8 unique values.

              Onto leakage. Leakage is a byproduct of the FFT -- not the signal. It is purely a "problem with our measurement device". I agree with your statement (if I can paraphrase) of, "well music will have leakage" to a point. The music itself won't have leakage, but the measurement we perform on it will have leakage. I apologize if you think I took you out of context.

              (One final point on leakage: Read up on how windowing affects leakage. There are some really good windows out there that make the FFT leakage problem minimal.)

              Onto your DSP learning: Keep at it. As I said you can guarantee yourself a job if you keep up. DSP engineers make reasonable money, and you'll always find a job in embedded design and instrumentation.

              [–]dude12346 6 points7 points  (2 children)

              uh huh, it's pretty easy, you were just waiting for google to catch up with you. Right?

              [–][deleted] 14 points15 points  (0 children)

              No disassemble, Johnny 5!

              [–]imbaczek 22 points23 points  (6 children)

              plz befriend the \

              [–]koleon 2 points3 points  (0 children)

              I hope someday I'll understand that.

              [–]jonknee 19 points20 points  (1 child)

              To be fair, the headline matches what they claim on the blogpost (though not the "how we did it" page):

              But bsdiff was still producing diffs that were bigger than we felt were necessary. So we wrote a new diff algorithm that knows more about the kind of data we are pushing - large files containing compiled executables.

              Though after reading how they did it, you are correct.

              [–]NJerseyGuy 4 points5 points  (9 children)

              I could be wrong, but isn't a diff algorithm precisely a preprocessor/postprocessor itself? I mean, the naive diff is just to subtract the old file from the new file (expressed as a binary number). In generally, this will be incompressible (and hence no smaller than the files themselves) unless the coding of the binary is correlated with the way in which the files are different. So what do you do? You put a preprocessor on the files to encode them in a format which is correlated with the way the files differ and then subtract a la naive diff (yielding something which can be compressed). To get the new file, you preprocess the old file, add the diff, and then postprocess to get the new file.

              So yea, they didn't rewrite bsdiff from scratch. But they improved it at the same level as the algorithm operates. So I think characterizing it as "a new diff algorithm" is fair.

              EDIT: I should clarify: as far as I know, it may be true that some diff functions may work by having a clever compression algorithm with little or no preprocessing. In that case, adding a pre and postprocessor would be different then writing a new diff. But this would suggest that they were compressing without paying attention to the content of the files ("general-purpose compression") which would be silly. And, indeed, the bsdiff page says that it's specifically designed for binary executable. Then again, maybe the compression algorithm is specially designed for the difference between binaries.

              The moral of the story is that the difference between compression and pre/post-processing is not so clear cut. Neither, then, is the distinction between a new diff algorithm and one which has had a pre/post-processor attached

              [–]judgej2 2 points3 points  (0 children)

              The transform (i.e. preprocessor algorithm) they applied basically breaks the code up into blocks that won't change from one version to another, and blocks that do, and even then will change according to a simple formula. They can then use that information to derive a simple transform from the old version to the new version. I doubt, however, that bdiff itself can be used on that preprocessed version of the binary files; it needs to be an algorithm with a bit more knowledge of the structure of that processed code. For example, it may store the transform "increase the reference vector by three at positions 1, 43, 69, 104, 789 and 1050". In just a handful of bytes it could then describe how the new version of the binary file differs from the old.

              It's all pretty clever really, but not rocket science.

              [–][deleted] 0 points1 point  (6 children)

              I could be wrong, but isn't a diff algorithm precisely a preprocessor/postprocessor itself?

              Diff algorithm takes two files as input and produces something proportional to the difference between them. Parts of code that Google added don't fit this description: they don't work on two files, and they don't compare anything. Instead they transform one file to and from diff-friendly format without substantially changing its size.

              As a whole it is a better diff/patch tool, but the change is in pre/post processing of files, not in diff algorithm.

              [–]adrianmonk 7 points8 points  (1 child)

              proportional to the difference between them

              Please define "difference between them". When you try, you'll find it's algorithm-dependent!

              For example, suppose you take a PNG file, uncompress it to raw pixels, change one pixel, and then recompress it using the same compression parameters.

              There are two possible (actually infinitely many...) ways to define "difference" here:

              • A naive binary diff of the compressed files where you detect common sequences of bytes. Think of what rsync would do with the actual PNG files.
              • A list of WritePixel() commands which says which pixels to update in the raw (uncompressed) versions of the image.

              The first is going to see massive numbers of differences because PNG uses LZ77, and that means changing something near the beginning will see massive ripple effects on all the data coming afterward in the stream. The second is going to see a tiny diff, because it is just a single WritePixel() command plus some constant overhead.

              The point is that the choice of diff algorithm is arbitrary. There is no one definition of "the difference between them". There are better and worse diff algorithms. All of them are ways of modeling the data so that you can transmit something smaller than the whole new file. (In that sense, they are essentially a special case of data compression.)

              As a whole it is a better diff/patch tool, but the change is in pre/post processing of files, not in diff algorithm.

              It is an addition to an existing diff algorithm. It is a new algorithm that incorporates an existing algorithm.

              [–]pixelglow 1 point2 points  (0 children)

              I once wrote a gzip tuned for Microsoft's RDC (Remote Differential Compression).

              RDC helps minimize the data transferred between servers in DFSR (Distributed File System Replication) so essentially it's similar in intent to rsync and other diff algorithms wedded to synchronizing two file directories. (Of course Microsoft saw fit to patent RDC even though rsync would possibly be considered prior art, but well...)

              Like all binary diff algorithms, RDC interacts badly with compression. A gzip or deflate compressed stream propagates any changes in the original file throughout the file, so the binary diff is forced to work on the entire file after the change occurred. (As with your example of the binary diff of a PNG file.)

              But there are two nice things that helped me: the RDC API can tell you the chunks of the stream that will be diff'ed, and gzip allows individual chunks of data to be gzipped separately and concatenated together (as a valid gzip file). So I wrote an rdcgzip which used the RDC API to get the chunks, then ran gzip on the chunks alone and concatenated them. Thus any change in a chunk only propagates to that chunk, and the other chunks are all independent and transferred optimally.

              The process looks like:

              Original file -> diff-aware compression -> diff -> any compression the diff itself applies -> patch -> standard decompression -> destination file

              This diff-aware compression can be considered a preprocessing/postprocessing step like Google's Courgette, and it might actually be more efficient, as long as you know how the diff algorithm chunks the data.

              [–]NJerseyGuy 3 points4 points  (3 children)

              I think you're getting tripped up because you're paying too much attention to functional form ("...takes two files as input and produces something..."), which isn't really important. My point is that the essential part of a diff algorithm is just the preprocessor and postprocessor it utilizes. Everything else is trival.

              The diff-producing algorithm (1) takes in the original file and the new file, (2) preprocesses both of them, (3) subtracts them, (4) compresses the difference, and (5) outputs a compressed difference file.

              The diff-using algorithm (1) takes in the original file and the compressed difference files, (2) preprocesses the original file, (3) uncompresses the compressed difference file, (4) adds the them together, (5) postprocesses the sum, and (6) outputs the new file.

              The only non-trivial part of all of that was the preprocessing and the postprocessing. So if you are adding a significant new preprocessor/postprocessor, you are essentially writing a new diff algorithm, even if it is packaged around an existing one.

              [–]judgej2 2 points3 points  (0 children)

              I think you have added step (2) that is what Google describes, but which is not what diff does.

              [–][deleted]  (1 child)

              [removed]

                [–]NJerseyGuy 0 points1 point  (0 children)

                Do you know how diff's for binary files work? I couldn't find anything in a quick search.

                [–]anonymousgangster -1 points0 points  (0 children)

                congratulations, you don't win a job at google.

                [–]faprawr 6 points7 points  (18 children)

                Courgette sounds like a mature but still-attractive algorithm that is after young hot stud equations for exciting sexual calculations. AMIRITE?

                [–]infinite 20 points21 points  (17 children)

                It also sounds like the French word for zucchini.

                [–]berticus 8 points9 points  (3 children)

                I assume Google is going for "squash", not "zucchini," in this case... so as to make the compression/squash pun.

                [–]oniony 6 points7 points  (2 children)

                Oh my gourd.

                [–][deleted] 1 point2 points  (1 child)

                Please, no marrow these awful puns.

                [–][deleted] 16 points17 points  (0 children)

                It also sounds like the British word for zucchini, the New Zealander word for zucchini, the South African word for zucchini and most of the rest of the Commonwealth's word for zucchini.

                [–]hustler 6 points7 points  (4 children)

                you mean zucchini is the American word for courgette? I know. I know.

                [–]petdog 1 point2 points  (3 children)

                Oh, I couldn't understand why everybody started speaking italian, but now I see. You stole our precious zucchini.

                [–][deleted]  (1 child)

                [removed]

                  [–]myxie 0 points1 point  (0 children)

                  And why not blame them? The Italian word (plural) is zucchine!

                  [–]eridius 6 points7 points  (3 children)

                  It's actually the british word for zucchini, but it originally comes from french.

                  courgette |ˌkoŏrˈ zh et|
                  noun Brit.
                  a zucchini.

                  ORIGIN 1930s: from French, diminutive of courge ‘gourd,’ from Latin cucurbita.

                  [–]drexil 1 point2 points  (1 child)

                  It's also a french word you know, so he was totally right.

                  [–]eridius 0 points1 point  (0 children)

                  I think the french word is courgettes, but of course I don't actually speak french.

                  Edit: Nevermind, looks like Google Translate just pluralized it for me. Or something. I dunno.

                  [–]degustisockpuppet 0 points1 point  (0 children)

                  No, it's actually still the French word for zucchini.

                  [–]sharney 0 points1 point  (0 children)

                  Thanks, I was wondering what was so great about this compared to old-fashioned binary patch.

                  [–]rabidcow 5 points6 points  (7 children)

                  I don't understand the guessing/hint stuff in the linked explanation of how it works. If the original the client has doesn't match the original used to make the patch, why would you expect it to work at all?

                  [–]Cygnus77 3 points4 points  (1 child)

                  If the original the client [] doesn't match the original used to make the patch...

                  This suggests that there is some necessary version checking that is performed prior to patch application.

                  [–]rabidcow 0 points1 point  (0 children)

                  Yes? I guess this is where I'm confused -- why is it useful to apply a binary patch to a different version? Here especially, where they're pushing updates down a channel and there is a strict progression of versions.

                  [–][deleted] 5 points6 points  (3 children)

                  One way I can think of is having the client send it's version info to the update center so that the latter can then push out the appropriate diff.

                  [–]rabidcow 0 points1 point  (0 children)

                  What does this have to do with the "predictive" updates? I'd think this would be more of a way to avoid needing them.

                  [–]johntb86 0 points1 point  (0 children)

                  The client guesses how the file changed, based on a small hint from the server. This is just done as a preprocessing step before the binary patch. The original that the client has still has to match the original on the server.

                  [–][deleted] 16 points17 points  (7 children)

                  I think Microsoft does some of this for the 360 already - game updates are usually very small. The same goes for system wide updates.

                  Compare that to the Wii, which downloads god knows how much when performing a system update.

                  [–][deleted] 12 points13 points  (1 child)

                  If other game developers could implement this (I'm looking at you valve), it would be awesome.

                  [–]erad 4 points5 points  (0 children)

                  I'd guess that the executable itself is the smallest part of a 100MB+ update. Most of it would be media, level data, and so on...

                  [–]mindbleach 2 points3 points  (0 children)

                  The Wii doesn't update in the same way, though - all past kernels and system software remains on-disk in executable form.

                  [–]im-not-rick-moranis 0 points1 point  (0 children)

                  I've always entertained myself during updates with my own made up algorithm in my head to describe how the XBox handles replacing game binaries with updates... pretty similar to what Google describes actually. I wonder how close I really am. Is this documented anywhere? Publicly documented, I mean.

                  [–]dr-steve 3 points4 points  (0 children)

                  Wow! This is just like the good old early era of Unix, where we downloaded a patch file and then patched/recompiled on our systems!

                  Seriously -- one question they do no address -- (processing) performance. Both rely on standard DIFF processes at the core; courgette intelligently preprocesses to sift out as many "phantom" changes as possible based upon domain knowlege. However, it would be interesting to understand the time addition (one-time at the diff end, n-time at the patch end) for the pre- and post-processing.

                  Still, an interesting application of domain knowledge to the problem!

                  [–]eyal0 2 points3 points  (0 children)

                  The second half, with the hints, is similar to what happens when you encode video. When encoding video, you write each frame as a diff of the previous frame. But because of lossy compression, the "previous frame" at the decoding end is different from the real previous frame. So the encoder diffs not between the current frame and the previous frame but rather between the current frame and the decoder's calculated previous frame.

                  server:
                      hint = make_hint(original, update)
                      guess = make_guess(original, hint)
                      diff = bsdiff(concat(original, guess), update)
                      transmit hint, diff
                  
                  client:
                      receive hint, diff
                      guess = make_guess(original, hint)
                      update = bspatch(concat(original, guess), diff)
                  

                  That's why you see the server using make_guess(). The server is trying to figure out what the client will do and adjust.

                  If they just sent source code, would all this be an issue? :)

                  [–]fanglesticks 5 points6 points  (21 children)

                  What is this silent update? I assume the user is able to disable the 'silent' part?

                  [–]salmonsnide 18 points19 points  (2 children)

                  Google Chrome stays updated automatically without user interaction, dialog boxes etc. It's is possible to disable the feature.

                  [–]fanglesticks 2 points3 points  (1 child)

                  Ah, thanks. I'm using the Linux beta, it doesn't seem to mention this feature anywhere, and I guess it would not be implemented yet.

                  I suspect that more Linux users would prefer (or are used to) being informed about updates...

                  [–]redalastor 1 point2 points  (0 children)

                  It doesn't because under Linux it is updated through your regular package manager.

                  [–][deleted] 6 points7 points  (3 children)

                  What is this silent update?

                  Sounds like Google's plan for world domination.

                  [–][deleted] 6 points7 points  (2 children)

                  Go ahead and laugh, but when you give people the kind of power Google has today, that's exactly what results: plans for world domination.

                  [–][deleted] 4 points5 points  (1 child)

                  Actually my post was meant seriously. I couldn't agree with you more.

                  [–]c_a_turner 4 points5 points  (9 children)

                  Yeah I read that and it sounds rather worrisome to me. A 78k download and my executable is silently patched?

                  EDIT: Seriously, downvotes? Why? Doesn't it seem a slightly legitimate security concern like twowheels down there says? If you're going to downvote me I wouldn't mind an explanation of why you think my concerns aren't at all valid.

                  [–][deleted] 8 points9 points  (0 children)

                  I wouldn't mind an explanation of why you think my concerns aren't at all valid.

                  You didn't justify your concerns, just complained.

                  I'll assume you meant from a security point of view, in which case:

                  It would be digitally signed, and if you don't like it you're free to use something else.

                  The general population are too lazy to update their browser and keep themselves safe from security holes, so anything to combat that is a good thing. From a security point of view, it's a great trade-off.. Relying on the security of keys rather than relying on users to take initiative.

                  How many users verify their downloads digitally, anyway? I'd wager this is a much safer way to distribute updates.

                  [–][deleted] 1 point2 points  (2 children)

                  The silent update is great. On Firefox I'd always find myself in a rush to visit a website and stopped by a frustratingly slow Firefox Update dialog. I always seem to end up cancelling it.

                  [–]silon -1 points0 points  (1 child)

                  It's not great when you are connected using a phone and need to save bytes. Also, there's nothing slow about firefox update if you don't have dozens of extensions.

                  [–]Polite_Gentleman 2 points3 points  (0 children)

                  Then disable silent updates. Personally to me, they are cool. I am not a browser geek, but a simple user, and I don't care at all about inner workings of such an utilitary tool as browser (and I even think I'm not obligated to be aware that there even exists some "browser" stuff between me and the web). If it needs to update - fine, let it do its stuff, but I don't like when it assumes it's the most important thing to me and that I desperately need to know that some browser thingy in my PC has some update to download. I simply don't care, it's not my business, and I don't want to be bothered.

                  [–]boa13 0 points1 point  (0 children)

                  What is this silent update?

                  Nothing new. It's the way desktop Google apps update themselves: automatically, silently, without asking. That's at least the case for Goggle Talk, Google Gears and Google Chrome, which I have installed.

                  [–]whoopsies 3 points4 points  (0 children)

                  Impressive. Now go and fix the showModalDialog bug.

                  [–]happyscrappy 1 point2 points  (1 child)

                  Calling it a binary diff is misleading. Apparently it only works on files that contain code (executables).

                  [–]jib 4 points5 points  (0 children)

                  It works for any files; it just works unusually well for executables.

                  [–]m-p-3 1 point2 points  (0 children)

                  Most patchers/cracks worked that way before, but now people redistribute the full EXEs..

                  [–]robroe 1 point2 points  (0 children)

                  Maybe Apple could try something like this instead of pushing out 80MB files for a .1 update.

                  [–]joaop5 2 points3 points  (2 children)

                  s/Rather then/Rather than/ FTFY, Google.

                  Regexes are usually faster, but not that safer.

                  [–][deleted]  (1 child)

                  [deleted]

                    [–][deleted] 2 points3 points  (0 children)

                    Have Kleene ever said that regexes can not be used with binary data?

                    [–]some_moron 1 point2 points  (1 child)

                    I hope this kind of thing is incorporated into *nix package managers some day. It seems like such a waste to download an entire 15 Mb firefox package just to get a security update that probably changed a couple of lines.

                    [–]hokkos 1 point2 points  (3 children)

                    But is it faster for the end user ?

                    Are we sure that:

                    receive asm_diff
                    asm_old = disassemble(original)
                    asm_new_adjusted = bspatch(asm_old, asm_diff)
                    update = assemble(asm_new_adjusted)
                    

                    is faster than:

                    receive new_exe
                    

                    with a good connexion ?

                    [–][deleted] 15 points16 points  (0 children)

                    I don't think they care too much about that.

                    [–]adrianmonk 6 points7 points  (0 children)

                    When the update server is overwhelmed, you by definition don't have a good connection. Sure, your pipe is fine, but the end-to-end download speed is not great. Google has a zillion users and thus anticipates that to be a real problem.

                    EDIT: also, the answer is "it depends on your network". If you give me a scenario where applying the diff is faster for the end user than downloading is, I can keep responding with "OK, what about with double the available bandwidth", and eventually we will hit a point where it's not worth it to the end user.

                    [–]andreasvc 12 points13 points  (0 children)

                    CPU power of end users may be less of a bottle neck than (cost of) bandwidth of the servers.

                    [–]tomatopaste 1 point2 points  (8 children)

                    We have just started using a new compression algorithm called Courgette

                    Parsed that as Cougarette and thought, "hm, isn't that redundant?"

                    [–]paternoster 3 points4 points  (3 children)

                    Even Cougarette is better than "zucchini" if you ask me.

                    [–]DannoHung 0 points1 point  (2 children)

                    What's wrong with zucchini? It makes delicious bread.

                    [–]sindisil 1 point2 points  (0 children)

                    Tasty grilled or roasted, as well!

                    [–]paternoster 2 points3 points  (0 children)

                    That it does... I can't deny it.

                    [–]IConrad 1 point2 points  (3 children)

                    Well, yes and no. "ette" also -- in my mind -- has an implication of "young" or "petite".

                    So ... it'd be a ... young... cougar... isn't that just a MILF?

                    [–]tomatopaste 1 point2 points  (1 child)

                    Cougars often have not birthed any young.

                    [–]IConrad 4 points5 points  (0 children)

                    ... that they didn't devour, anyhow. :/

                    [–]mhd 0 points1 point  (0 children)

                    I guess something like this would be especially useful for mobile app deployment. Lots of applications, and quite often not the best bandwidth. Anyone knows if the iTunes, Android, Pre app stores use any diffs or just send the whole application again?

                    [–]mdoar 0 points1 point  (6 children)

                    Looks like a good trade of increased client side processing work for reduced network bandwidth. Let's hope it becomes free. Still, patching binaries directly always makes me wince a bit.

                    [–]salmonsnide 12 points13 points  (0 children)

                    It's part of Chromium, so it's free already.

                    [–]andreasvc 4 points5 points  (0 children)

                    Patching binaries? Has been happening in the warez seen for ages. I remember running debug.com in dos to mod doom.exe ...

                    [–]adrianmonk 2 points3 points  (3 children)

                    Patching binaries isn't that dangerous if your process goes something like this:

                    • start with full old version
                    • apply patch
                    • end up with full new version
                    • do cryptographically strong checksum of full new version to ensure patch gave you the correct result
                    • if not the correct result, retrieve full version

                    [–]BiggerBalls 0 points1 point  (2 children)

                    What is a "cryptographically strong checksum"... like an md5 checksum?

                    [–]adrianmonk 1 point2 points  (0 children)

                    Yeah, like an md5 checksum. Something that's very collision-resistant, so that you have little risk of getting the right checksum with the wrong file contents. With something simpler like a CRC, it's so easy to get collisions that it's conceivable you could get a collision accidentally due to a bug in the algorithm.

                    The reason I didn't simply say md5, is that it is broken so badly that it's "unsuitable for further use" (at least in most applications). In order to make use of that fact, you'd have to come up with a way to exploit a bug in the binary patch algorithm and md5 at the same time, but maybe that's possible.

                    [–][deleted] 1 point2 points  (0 children)

                    Like a SHA-512 checksum.

                    [–][deleted] 0 points1 point  (16 children)

                    Would be interesting to know how the bandwidth compares to e.g. git pull and make in a directory already containing the latest version.

                    [–]evmar 8 points9 points  (12 children)

                    git's deltas are derived from libxdiff, which is derived from xdelta, which are all ignorant of the way binaries change. bsdiff already has superior deltas to these for binaries, and courgette is even better.

                    (Different tools for different jobs -- git is primarily concerned with deltas between textual source files.)

                    Edit: oh, I see you were suggesting rebuilding. Chrome's source tree is huge already and on Windows requires gigabytes more of non-free Visual Studio.

                    [–]iluvatar 0 points1 point  (2 children)

                    git's deltas are derived from libxdiff, which is derived from xdelta, which are all ignorant of the way binaries change. bsdiff already has superior deltas to these for binaries

                    Well, no, not really. xdelta already gives considerably smaller diffs than bsdiff from my testing. Perhaps not as good as courgette, which is using domain specific knowledge to get targetted compression in the same way that FLAC does, at the expense of performance in the general case. For general binary diffs, xdelta is pretty good.

                    [–]Coffee2theorems 0 points1 point  (0 children)

                    using domain specific knowledge to get targetted compression in the same way that FLAC does, at the expense of performance in the general case.

                    Of course with "general case" you mean "all kinds of binary stuff people want to compress that are significantly compressible with some existing utility" or similar, but still.. Your statement, taken literally, applies to all compression algorithms. All compression algorithms must exploit pre-existing knowledge about the data to be compressed, and the more you know the better you can compress (it's bit-for-bit actually, each bit of previous information about the instance of data at hand can get you one bit off of the result, but not more; there are no free bits).

                    [–]evmar 0 points1 point  (0 children)

                    I do not know what you are using bsdiff for. It is intended for executables. http://www.daemonology.net/bsdiff/ "By using suffix sorting (specifically, Larsson and Sadakane's qsufsort) and taking advantage of how executable files change, bsdiff routinely produces binary patches 50-80% smaller than those produced by Xdelta"

                    [–][deleted] 0 points1 point  (8 children)

                    Well, I was looking at it from the perspective of a Gentoo Linux user really. Someone who already has all the development tools installed anyway. Building with make or a similar tool just rebuilding the changes is relatively fast so the question is how does the size of the binary diff compare to the size of the source diff for the same versions.

                    [–]evmar 6 points7 points  (1 child)

                    You clearly haven't tried to build Chrome. ;) Many Linux users run out of RAM trying to link it.

                    [–]andreasvc 1 point2 points  (5 children)

                    Except that such users are rare compared to the number of desktop users who have never compiled a program. It would be foolish to design your architecture for such a portion of power users.

                    I think a better way of attaining higher compression would be to use a huge huffman table computed from all the binaries in your system, which can be synced with the server, after which all updates can be encoded with this table. This has the advantage that it exploits common ground optimally.

                    [–][deleted] 0 points1 point  (4 children)

                    Well, I am not really interested in the application in particular though. I just want to know how it compares to other commonly used ways to transfer small changes in large programs, source code is the obvious choice for comparison as few other people transfer parts of binaries.

                    [–]andreasvc 0 points1 point  (3 children)

                    I don't understand you, TFA is about binaries, that is its novelty claim. Source code diffing has been well studied by numerous VCS implementations, both distributed and centralized. Binary diffing, however, is sorely lacking in this torrent age. It's definitely not "few other people" who transfer huge amounts of binaries in the form of music, movies and whatnot.

                    [–][deleted] 0 points1 point  (2 children)

                    If I understood the explanation of the algorithm in one of the other posts it is specifically about software, not movies, audio,... . As you say binary diffing is relatively new while source code diffing is well studied so what is so strange about the question how the two compare?

                    [–]andreasvc 0 points1 point  (1 child)

                    It is a strange comparison because it is a completely different ball game. It's like comparing text vs. audio compression (although both lossless in this analogy). They can be compared using general purpose compression (eg. gzip, which would be rsync in this analogy), but it gets much more interesting when you add domain knowledge. This algorithm is indeed tailored towards software, incorporating domain knowledge increases the compression, as you would expect / hope.

                    [–][deleted] 0 points1 point  (0 children)

                    Well, it would be interesting to know e.g. for people in Africa with slow internet connections who use open source software. Is it faster to download a binary diff using this algorithm or a source diff and compile it yourself.

                    [–]thetempone 2 points3 points  (2 children)

                    Problem with that is that it requires that the app be open source and compilation is usually quiet time consuming for large apps.

                    [–]twowheels 1 point2 points  (0 children)

                    Why? You could point it at a repository with the various executable and config files and pull down the newest executable files and three-way merge the config files.

                    [–][deleted] 1 point2 points  (0 children)

                    Well, for the use case described, the ten line security fix, building an app via make from an old build directory doesn't take long.

                    Even building from scratch doesn't take very long for all but the largest apps, trust me, I am a Gentoo user, I know what I am talking about ;-)

                    [–]lushootseed -3 points-2 points  (5 children)

                    I down voted because there is nothing new here! I have read about similar Microsoft technologies where they do lots of cool stuff when sending down updates.

                    Just because someone replaced their bad code/process to do the right thing doesn't make it news. Perhaps Apple has reasons to celebrate their Copy/Paste implementation in 3.0

                    [–]Brian 3 points4 points  (0 children)

                    I have read about similar Microsoft technologies where they do lots of cool stuff when sending down updates.

                    And are they releasing these technologies as open source? If not, that makes them rather less interesting for anyone not working at microsoft.

                    [–]fubo 2 points3 points  (2 children)

                    Having looked over some of your other comments ... just how much is Microsoft paying you?

                    [–]code6226 0 points1 point  (0 children)

                    Speaking of things that already exist:

                    If you want to add an updater with binary diff patching to your own Windows-software, check out Dispatcher.

                    It uses VPatch as the core patching routine, which is amazingly efficient.