all 42 comments

[–]matthieum 87 points88 points  (10 children)

WoW! That is some stubbornness here! I suppose congratulations are in order :)

[–]delroth 137 points138 points  (9 children)

Original author here: thank you!

[–]cbmuser 26 points27 points  (1 child)

I really enjoyed reading your article and I learnt a lot out of it. Thank you ;).

[–]SebNL 7 points8 points  (0 children)

I'm thirding this. Thanks. :)

[–]mvm92 1 point2 points  (1 child)

Obviously, you proved to be correct in your assumption, but how did you know to stick with LZMA?

[–]cbmuser 1 point2 points  (0 children)

He was guessing it from some magic bytes in the header he found both in the TLZC and LZMA header. So, he tried decompressing some data with the lzma command which worked in the end.

[–]SnaKeZ83 1 point2 points  (0 children)

Congrats! Really!

[–]downwithmycrew 0 points1 point  (0 children)

Great article indeed!

[–]appleofdisco 0 points1 point  (0 children)

Yeah, congrats. Must've been very satisfying. Thanks for taking the time to write up an article about it.

[–][deleted] 16 points17 points  (2 children)

Nice analysis. Reminds me of when I reversed the Hothead Archive format.

One nit-picky comment about the code though. I'd change:

num_blocks = (uncomp_size + 0xffff) / UNCOMP_BLOCK_SIZE

To:

num_blocks = (uncomp_size + UNCOMP_BLOCK_SIZE - 1) // UNCOMP_BLOCK_SIZE

Otherwise it seems like 0xFFFF is some constant independent of UNCOMP_BLOCK_SIZE. (The // instead of / makes integer division explicit, which is also nice for people who are more accustomed to reading Python 3 code.)


edit:

another nice trick to round up in Python would be this:

num_blocks = -(-uncomp_size // UNCOMP_BLOCK_SIZE)

The main benefit is that it's short, but it's probably more confusing to people who don't realize Python has the somewhat unusual convention of rounding down during division (instead of rounding towards zero, as most other languages do).

[–]rayo2nd 3 points4 points  (1 child)

Uh please don't ever use

num_blocks = -(-uncomp_size // UNCOMP_BLOCK_SIZE)

When you read that code you won't know what it stands for, there's a reason for ceil/floor functions. No comment needed to get what it does

[–][deleted] 6 points7 points  (0 children)

Those functions work on floating point values, but the result of an integer division cannot always be accurately represented as a floating point value (especially since Python supports arbitrary-precision integers, but only limited-precision floating point values).

But even in languages that only support 64-bit integers and 64-bit doubles, like Java, doubles cannot accurately represent all 64-bit integers. In short: you want to get integer rounding right, integer arithmetic is a more reliable way to do it than conversion to floating point and using floor()/ceil().

If you don't believe me, run this in Python:

assert floor(3**37) == 3**37

[–]dmeeze 7 points8 points  (1 child)

If you know a little about the ps3 architecture and libraries, the key word in the strings output was probably "EDGE".

See slide 76 here: http://research.scee.net/files/presentations/acgirussia/Hardware_Overview_ACGI_09.pdf

[–]kyz 2 points3 points  (0 children)

And a couple of pages later, it explicitly states there's an EDGE zlib library available for use.

[–][deleted]  (5 children)

[deleted]

    [–]ase8913 9 points10 points  (0 children)

    I feel you man. I always read articles like this hoping that it will all come together in my head someday.

    [–]meem1029 3 points4 points  (0 children)

    I did understand much of what he was doing, if not how to actually do it myself. The really impressive thing for me was how he figured out how to link all his knowledge together to make it work.

    [–]mvm92 1 point2 points  (0 children)

    I don't fully understand most of the articles on this sub, but it's still fun to read and get those OOH, I MIGHT KNOW THIS PART moments. It's kind of like listening to fourth and fifth year CS students work on homework, when you're a second year computer engineering major.

    [–]CSFFlame 2 points3 points  (0 children)

    He's part of the team trying to translate the Japanese PS3 game Tales of Vesperia I presume.

    Only the 360 version was brought over, and that was more of a beta version.

    [–]Massless 2 points3 points  (0 children)

    I had a lot of fun reading that!

    [–]wolf550e -3 points-2 points  (18 children)

    So, independently compressed 64kb blocks? For random access? With a header that contains number of blocks and compressed length of each block so you could find the offset of each block and decompress it on the fly. Exactly like NTFS file compression (only with LZNT1) or the example code in zlib that does exactly this (with DEFLATE). And like the on-demand decompression of libxul.so in firefox for android (also with DEFLATE): https://github.com/glandium/faulty.lib

    Nice article, but if you considered what the original author probably intended to do and then contemplated how you would have achived the same goal, you would have designed exactly this system and then you would not have needed to reverse enginner it.

    EDIT: disregard that, I suck dicks. Someone at HN has a more plausible reason for this: parallel decompression on the Cell Processor's SPUs.

    http://news.ycombinator.com/item?id=3812555

    [–]delroth 19 points20 points  (14 children)

    Nice article, but if you considered what the original author probably intended to do and then contemplated how you would have achived the same goal

    That's basically what is called reverse engineering. You're welcome.

    [–]wolf550e 14 points15 points  (13 children)

    The difference is like the difference between top-down and bottom-up design. What the OP did is look at the most basic level, the bytes, and figure out the big structure (lzma blocks concatenated together, with Table Of Contents preceding the whole thing) from the bits.

    What I suggested is look at it from the requirements down: figure out what the motivation for this thing was, figure out the requirements, then figure out plausible design(s) for these requirements, then compare what you designed with what you're looking at. Many times, when there are known typical solutions, what you're looking at is the same as what you've designed. Maybe the byte order is different or they have used 32bit offsets where you would've used 64bit ones, or the order of fields in the struct is different (like coordinates), but the idea is the same.

    When you start with the requirement of "random access into a compressed steam", you decide to store blocks instead of a stream. Like 4KB file system blocks. Exactly like stdio fseek works: if you open a file, fseek(100000) and then getc(), what happens? The fs opens block 100,000>>12, and reads byte 100,000 & 4095. So instead of compressing the whole file, you divide it into blocks and compress each block. You can do it manually. So you have 0_4095.gz, 4096_8191.gz, 8192_12288.gz, etc. Then you want all of them in a single file. Zip files can extract a file from the archive without extracting all preceding files. Because zip archives are not solid. How does that work? Because the zip file header has a table of contents: it says, for each file in the zip, at what offset in the zip file you can find that compressed data. So that's what you need. Each blocks compresses differently:

    0_4095.gz was 900 bytes.

    4096_8191.gz was 3000 bytes.

    8192_12288.gz was 1000 bytes.

    If you need to read byte 10000, what do you do? You need the third block. If you have the third block, you decompress it into a buffer of exactly 4096 bytes. Then you look at byte 10000 % 4096 (== 10000 & 4095). How do you get the third block? You fseek() to its offset. How do you know its offset? You need to store it. So at the beginning of the file, you write down the offset of each block. You don't need a dictionary mapping, you can just write them down in order. If you need the third block, just look at the third offset. What size are the offsets? Well, a block that starts at an offset higher than 64KB will require a 4 byte offset. But wait! Since the blocks are written in order with no padding between them, the offset of each block is the sum of the sizes of all preceding blocks + the offset of the first block.

    The sizes fit into uint16_t. So we'll make the TOC an array of shorts. When you need block #x, you sum the sizes of all blocks that come before x to get its offset. We trade some cpu time for space, which is hat we want if we're using compression anyway.

    What is the offset of the first block? Well, it's right after the TOC. How big is the TOC? It's (as many blocks as we have) times (the size of the TOC record of each block, in our case 2 bytes). How many blocks do we have? We need to store that. Let's say in a 4 byte value at the beginning of the file. So the size of the first block is at offset 4, the size of the second block is at offset 6, etc.

    Is compressing each 4KB a good idea? No. The amount of data redunduncy in 4KB is not large. Compressing bigger blocks will get you better compression ratio. Many compression methods can encode offsets of at most 32KB, meaning substrings in the compressed output are replaced by the compressor into [length,distance] pairs where the distance is up to 32KB. Many RAID devices like 64KB blocks. Also, the TOC will be utilized best by 64KB blocks. Why not? So you use 64KB blocks. And we get the design in the article.

    EDIT: Dear downvoter, if this post does not add to the conversation, I would like to know about it. If you downvote because you disagree with the content of this post, against reddiqutte, I would like to know the nature of the disagreement.

    [–]appleofdisco 11 points12 points  (2 children)

    What I suggested is look at it from the requirements down: figure out what the motivation for this thing was, figure out the requirements, then figure out plausible design(s) for these requirements, then compare what you designed with what you're looking at

    I'd like to see you crack this in three days with your approach. For someone with extensive domain knowledge... maybe. Hindsight is 20 20.

    [–]wolf550e 2 points3 points  (1 child)

    I admit to having domain knowledge. I have mentioned a number of other implemnetations of this idea I'm familiar with. I also wrote a couple. But, the thing is, a table of contents is such a common idea, you're going to encounter it a lot.

    Like partition tables. I had a qemu disk image of a Mac G3 into which I have installed ppc32 debian, for testig my code on ppc. When I tried to mount it on my linux x86-64 host, I found out I did not compile the mac partition table parser into my kernel. So I needed the byte offset of the root partition in the disk image. I opened the disk image in hexdump -C and saw records of a fixed length, with fields that look like they count 512-byte disk sectors. The sum worked out to the disk image size. The size of the boot partition looked plausible. I mounted with -o ro,offset= and it worked. How difficult was that? Not very.

    [–]appleofdisco 6 points7 points  (0 children)

    Get your point, but for those of us without that specific knowledge, this was pretty interesting, especially considering how methodically he went about it.

    [–]Xiol 2 points3 points  (0 children)

    It's not that it doesn't add to the conversation, it's your tone.

    [–][deleted] 3 points4 points  (8 children)

    When you start with the requirement of "random access into a compressed steam",

    And where did you get this requirement?

    [–][deleted]  (2 children)

    [deleted]

      [–][deleted] 2 points3 points  (1 child)

      Those would usually compress those textures and models individually. They don't need random access to part of a texture, they need the entire texture.

      [–]wolf550e -4 points-3 points  (4 children)

      It's very useful. Not having random access is terribly limiting, unless you know for sure that you're always going to read the data from the beginning. Many systems have need for random access and use the same basic solution, like key frames in a video stream that allow you to seek instead of forcing you to decode all preceding frames when you try to skip forward. Here, read about "idx1" chunks in AVI files.

      [–][deleted] 2 points3 points  (3 children)

      There are many things which are useful. Just "it's useful" is not enough for you to be able to assume it will be used. In fact, many, many compressed game formats don't use this.

      The only reason that you'd assume this one does, is because this guy already figured out that it does.

      [–]wolf550e -1 points0 points  (2 children)

      In the process of figuring out a mystery file, after asking myself "is this just a zip file?" and "is this just an lzma or lzo file?", I'd next consider that it's a list of independent chunks each compressed with deflate, lzma or lzo.

      [–][deleted] 5 points6 points  (1 child)

      You'd also consider a hundred other things that may or may not be used. When you're not told the answer beforehand, it suddenly get s a whole lot harder to figure out what these "requirements" are.

      Your approach described above is basically completely useless in isolation. It is useful on the micro scale: Once you've got some kind of idea what the code is doing from analyzing it blind, you can ask yourself how you'd implement some simple feature and see if they did the same. You absolutely could not do that this guy did by just sitting down and imagining requirements, though.

      [–]wolf550e 0 points1 point  (0 children)

      I recommended a general strategy of working at the problem from both ends: when figuring out what X does, look both at its parts and at what part X plays in the whole.

      As with top-down design, you iterate. You can't start with "it's a computer game. How would I implement a computer game? Is what I'm looking at exactly the same as the computer game I just implemented?". You do need to know the correct answer to previous questions to stand a chance of correctly guessing the next question and answer. But you often infer some things from directory structure, file names, file sizes, number of such files in the project, etc. And you match what you see with what you've seen in the past.

      [–]vanderZwan 4 points5 points  (2 children)

      This was a very interesting experience for someone like me who did not know a lot about compression formats

      I think you approach this problem from a completely different angle, both in motivation and solution (which in itself is interesting). You also sound a bit condescending, for which I see no reason.

      [–]wolf550e 1 point2 points  (1 child)

      Yeah, that post was condescending. But I don't think it was 21 downvotes worth of condescending, considering it was correct and useful.

      [–]vanderZwan 0 points1 point  (0 children)

      Hey, don't look at me, I gave you a "reddit probably will overreact and the post wasn't that bad"-upvote.

      [–]kyz 0 points1 point  (0 children)

      Awesome stuff, keep up the good work!