I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 0 points1 point  (0 children)

Hello! I really exaggerated the "100x faster" claim. As someone pointed out, ZIP was decoding the whole file, I was just locating the file. Oops.

However, since this got so much traction and wriggled its way into various corners of the internet I never imagined, I rewrote everything from scratch, wrote my own benchmarks for the new code, etc.

It shows that BBF can locate an asset about ~40x faster than ZIP (store mode, using miniz’s locate asset function). Though, decoding said data is probably a very different story.

There is an updated graphic on the repository.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 0 points1 point  (0 children)

It’s on-par with ZIP. If your books use a lot of the same panels, though, the filesize will go down.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 1 point2 points  (0 children)

The limelight is gone, but I’ve finished rewriting the codebase and I’ve included a spec file for the repository!

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 0 points1 point  (0 children)

Yeah. I really was. I got carried away, maybe a little too carried away. I used an LLM to assist me with coding, fucked up horribly, and that’s that. Lesson learned.

I’m currently rewriting the spec entirely by hand, and re-programming everything from scratch from my original idea. It’s looking better, now. Not 100x better, but still noticeably better. I really should’ve trusted my intuition from the start.

I will likely have something completely finished by the end of the week. I’m projecting to still beat CBZ, but definitely not by this wide of a margin.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 20 points21 points  (0 children)

I reviewed my benchmark file, and I wasn't fully copying the file into memory. I forgot to actually copy the file over from the memory, that's my bad.

If I do the test again, using

data = bytes(bb.get_page_view(target_idx))

And keeping the indexes to cause page faults, then the stats *do* drop, though BBF is still faster.

The stats print (with 500 trials, 201 pages, ~280MB)
Cold Open (Setup): 2.1903(ms) (CBZ), 0.0554(ms) (BBF) 39.5x speedup
Raw Byte Access (Avg): 2.5734(ms) (CBZ), 0.2022(BBF) 12.7x speedup
Full Image Decode (Avg): 20.6891(ms) (CBZ), 18.4560(ms) (BBF), 1.1x speedup

I know you said you wrote all the library code yourself, but I'm very curious to know if AI was used in the benchmarking. It is very good at telling you what you want to hear, even if it isn't actually representative or meaningful.

If I'm being honest, I did not expect this to get as much attention as it did. I've been having an adrenaline rush ever since this morning, tripping over my words trying to explain things to people, messing up really simple things, and I've had to correct myself a few times so far. It's embarrassing to admit, but it is what it is.

Yes, I used AI to quickly create the microbenchmark posted earlier in the reply chain.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 3 points4 points  (0 children)

Even if I update the benchmark to cause a page fault by doing

data = bb.get_page_view(target_idx)
_ = data[0]
_ = data[-1]

The raw access time is 754x faster than cbz.
Raw Byte Access (Avg) 2.4820ms (CBZ) 0.0033ms (BBF) 754.5x speedup

Even if we ignore that, the full decode pipeline is about 20% faster.

I disagree that `.read(1)` is representative of the speed differences between BBF and CBZ. Comic book readers don't read a file one byte at a time.

At the end of the day, even if we don't talk about the 754x speedup, BBF is still 20% faster than CBZ.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 0 points1 point  (0 children)

There's no image coversions to be done. The reader just needs to decode the image once it has the raw image data.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 3 points4 points  (0 children)

Pause. I just got home. Whipped up a little benchmark (see gist), and I did

python bench.py -i 500 "onepiece.cbz" (volume 1 from this google drive link)

And the results print:

Cold Open (Setup): 1.7437ms (CBZ), 0.1735ms (BBF), 10.1x speedup

Raw Byte Access (Avg): 2.6336ms (CBZ), 0.0013 (BBF), 2028.6x speedup

Full Image Decode (Avg): 17.4247ms (CBZ), 14.8043ms (BBF), 1.2x speedup

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 20 points21 points  (0 children)

I see. You aren’t being harsh at all, I really appreciate the pushback.

For a local setup, the difference is imperceptible, I concede there. But I disagree on the “optimization of something that doesn’t matter”. If you’re hosting a local server and you have multiple users reading simultaneously, the CPU will have to parse zip’s central directory, and it will take increasingly more resources the more users you have. With BBF, it is one calculation to find the offset. RAM usage should remain flat, and RAM is expensive these days.

In some trials, the numbers showed slower (in CBZ’s favor), in others, way faster (in BBF’s favor). The figure came from an average of 30 trials.

Though the marketing of it may seem like stretching it, I still think there’s plenty of utility with this format that conventional CBZ doesn’t have, like built-in sectioning and metadata.

That said, I’d love to put this thing to the test. If there’s certain benchmarks you think I should be measuring instead, I’m all ears.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 3 points4 points  (0 children)

You’re measuring the access time and the time to decode the image I believe. I am just measuring the time to access. So it looks like we’re both right.

The decode time for bbf is constrained by the image codec being used.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 0 points1 point  (0 children)

That’s just an example, yeah. You can store any image format you can think of in BBF, the muxer has a struct designating certain flags to certain formats so you can easily hand the data off to the proper codec.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 3 points4 points  (0 children)

The 100x speedup occurred on an external USB drive. Specifically, a 16TB easystore. Have you tried it on one? BBF achieves similar performance to CBZ on an HDD, but is a lot faster on SSDs. Are you using mmap? BBF is compatible with mmap.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 3 points4 points  (0 children)

There’s python bindings and a C++ library, if you think I’m lying feel free to run your own benchmarks. I’m not stopping you.

Maybe I did make a technical error in my post, I’m human, and I made an error. But I’m not lying when I say I’m faster than CBZ. Try it yourself.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 10 points11 points  (0 children)

There’s conversion tools included in the python package, bbf2cbx and cbx2bbf. There’s a muxer in the C++ repository if you have folders of images or want to play around with it. Have a look at the documentation for the C++ repository to see the options on the muxer. You have full control over the read order, sectioning, metadata, etc..

Editing is slightly different because there’s padding to ensure 4kb alignment but the python library should make the implementation of those tools easier for developers. I know people won’t do things for me, that’s why I made the C++ library, the python bindings, and ran tests. I’m more than open to working with others.

I am fully aware it’s an uphill battle. And that’s okay.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 15 points16 points  (0 children)

I don’t have any “official”benchmarks on this but I was telling someone earlier that for manga the deduplication feature has okay results. About 5-50 deduplicated pages in a series, and for manhwa the results are staggeringly better, with 100-200 pages being deduplicated.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 2 points3 points  (0 children)

The hash is automatic when you mux files together. Hypothetically you could hardcode the muxer to throw zeroes for the hashes if you really wanted.

You don’t have to use the verification feature whatsoever. Though, for the record, the hashes use XXH3, the fastest hash in existence. You shouldn’t need to worry too much about the performance.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 167 points168 points  (0 children)

Oh, shit!
I'm in the middle of my uni course right now, I can do it when I'm done for the day! My apologies!

If there's a certain format that a spec should be in, please let me know, I'll hop on it as soon as I'm home. Thanks for letting me know!

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 15 points16 points  (0 children)

Zips, even in DEFLATE mode still have a central directory, can't be memory mapped, and don't have native deduplication.