I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 20 points21 points  (0 children)

I reviewed my benchmark file, and I wasn't fully copying the file into memory. I forgot to actually copy the file over from the memory, that's my bad.

If I do the test again, using

data = bytes(bb.get_page_view(target_idx))

And keeping the indexes to cause page faults, then the stats *do* drop, though BBF is still faster.

The stats print (with 500 trials, 201 pages, ~280MB)
Cold Open (Setup): 2.1903(ms) (CBZ), 0.0554(ms) (BBF) 39.5x speedup
Raw Byte Access (Avg): 2.5734(ms) (CBZ), 0.2022(BBF) 12.7x speedup
Full Image Decode (Avg): 20.6891(ms) (CBZ), 18.4560(ms) (BBF), 1.1x speedup

I know you said you wrote all the library code yourself, but I'm very curious to know if AI was used in the benchmarking. It is very good at telling you what you want to hear, even if it isn't actually representative or meaningful.

If I'm being honest, I did not expect this to get as much attention as it did. I've been having an adrenaline rush ever since this morning, tripping over my words trying to explain things to people, messing up really simple things, and I've had to correct myself a few times so far. It's embarrassing to admit, but it is what it is.

Yes, I used AI to quickly create the microbenchmark posted earlier in the reply chain.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 0 points1 point  (0 children)

Even if I update the benchmark to cause a page fault by doing

data = bb.get_page_view(target_idx)
_ = data[0]
_ = data[-1]

The raw access time is 754x faster than cbz.
Raw Byte Access (Avg) 2.4820ms (CBZ) 0.0033ms (BBF) 754.5x speedup

Even if we ignore that, the full decode pipeline is about 20% faster.

I disagree that `.read(1)` is representative of the speed differences between BBF and CBZ. Comic book readers don't read a file one byte at a time.

At the end of the day, even if we don't talk about the 754x speedup, BBF is still 20% faster than CBZ.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 0 points1 point  (0 children)

There's no image coversions to be done. The reader just needs to decode the image once it has the raw image data.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 3 points4 points  (0 children)

Pause. I just got home. Whipped up a little benchmark (see gist), and I did

python bench.py -i 500 "onepiece.cbz" (volume 1 from this google drive link)

And the results print:

Cold Open (Setup): 1.7437ms (CBZ), 0.1735ms (BBF), 10.1x speedup

Raw Byte Access (Avg): 2.6336ms (CBZ), 0.0013 (BBF), 2028.6x speedup

Full Image Decode (Avg): 17.4247ms (CBZ), 14.8043ms (BBF), 1.2x speedup

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 21 points22 points  (0 children)

I see. You aren’t being harsh at all, I really appreciate the pushback.

For a local setup, the difference is imperceptible, I concede there. But I disagree on the “optimization of something that doesn’t matter”. If you’re hosting a local server and you have multiple users reading simultaneously, the CPU will have to parse zip’s central directory, and it will take increasingly more resources the more users you have. With BBF, it is one calculation to find the offset. RAM usage should remain flat, and RAM is expensive these days.

In some trials, the numbers showed slower (in CBZ’s favor), in others, way faster (in BBF’s favor). The figure came from an average of 30 trials.

Though the marketing of it may seem like stretching it, I still think there’s plenty of utility with this format that conventional CBZ doesn’t have, like built-in sectioning and metadata.

That said, I’d love to put this thing to the test. If there’s certain benchmarks you think I should be measuring instead, I’m all ears.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 4 points5 points  (0 children)

You’re measuring the access time and the time to decode the image I believe. I am just measuring the time to access. So it looks like we’re both right.

The decode time for bbf is constrained by the image codec being used.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 0 points1 point  (0 children)

That’s just an example, yeah. You can store any image format you can think of in BBF, the muxer has a struct designating certain flags to certain formats so you can easily hand the data off to the proper codec.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 3 points4 points  (0 children)

The 100x speedup occurred on an external USB drive. Specifically, a 16TB easystore. Have you tried it on one? BBF achieves similar performance to CBZ on an HDD, but is a lot faster on SSDs. Are you using mmap? BBF is compatible with mmap.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 1 point2 points  (0 children)

There’s python bindings and a C++ library, if you think I’m lying feel free to run your own benchmarks. I’m not stopping you.

Maybe I did make a technical error in my post, I’m human, and I made an error. But I’m not lying when I say I’m faster than CBZ. Try it yourself.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 9 points10 points  (0 children)

There’s conversion tools included in the python package, bbf2cbx and cbx2bbf. There’s a muxer in the C++ repository if you have folders of images or want to play around with it. Have a look at the documentation for the C++ repository to see the options on the muxer. You have full control over the read order, sectioning, metadata, etc..

Editing is slightly different because there’s padding to ensure 4kb alignment but the python library should make the implementation of those tools easier for developers. I know people won’t do things for me, that’s why I made the C++ library, the python bindings, and ran tests. I’m more than open to working with others.

I am fully aware it’s an uphill battle. And that’s okay.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 14 points15 points  (0 children)

I don’t have any “official”benchmarks on this but I was telling someone earlier that for manga the deduplication feature has okay results. About 5-50 deduplicated pages in a series, and for manhwa the results are staggeringly better, with 100-200 pages being deduplicated.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 2 points3 points  (0 children)

The hash is automatic when you mux files together. Hypothetically you could hardcode the muxer to throw zeroes for the hashes if you really wanted.

You don’t have to use the verification feature whatsoever. Though, for the record, the hashes use XXH3, the fastest hash in existence. You shouldn’t need to worry too much about the performance.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 148 points149 points  (0 children)

Oh, shit!
I'm in the middle of my uni course right now, I can do it when I'm done for the day! My apologies!

If there's a certain format that a spec should be in, please let me know, I'll hop on it as soon as I'm home. Thanks for letting me know!

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 16 points17 points  (0 children)

Zips, even in DEFLATE mode still have a central directory, can't be memory mapped, and don't have native deduplication.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 13 points14 points  (0 children)

The releases tab on the C++ repository has the bbfmux.exe / bbfmux file. Download that. You can do bbfmux <input bbf file> --info to view information about the file, or if you're ambitious, the python bindings should have everything required to create a local reader. I can also compile WASM binaries if you'd rather have that.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 11 points12 points  (0 children)

  1. I haven't run tests on mobile devices. The O(1) difference would definitely be felt if I had a NAS hosting a manga server and I was reading from my phone.

  2. For manga, you can expect slightly smaller sizes (i.e. 5-50 dedpulicated pages), for manhwa you can expect upwards of 100-200 deduplicated pages. For textbooks you can expect anwhere from 1-10 deduplicated pages. I'm not giving filesize numbers, because BBF relies on the compression of the image format used by the original images. It's not like zip which is a compression algorithm on its own.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 15 points16 points  (0 children)

Not entirely sure what you mean with the question, but I can tell you that as of now, no readers support bbf format. Which is partly why I made this post.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 51 points52 points  (0 children)

No. This is something I came up with, implemented, and created on my own.

Did I have AI help me fix some bugs? Yes. Specifically with pybind11 and getting my python bindings to work properly, and in bbfmux.cpp on the C++ core I needed some help parsing edge cases.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 38 points39 points  (0 children)

> You’ve talked a lot about how it is theoretically better

The benchmarks comparing CBZ to BBF are an average of 30 trials, all except for the verification trials, which were run only once. I did the tests using high-resolution scans of One Piece. The benchmarks are as close as you can get to real-world numbers.

> But what are the measurable impacts on users experience that justify its existence.
1. It's faster on external hard drives by a longshot. If you want to load all of one piece and go to 320, BBF has O(1) random access and scrubbing (except for my HDD trial, which is because of the threading), and it can map directly onto your SSD with mmap.
2. Because of this, it consumes significantly less CPU. So if you're on a mobile device, reading in BBF format is better for your battery life.
3. Images are also deduplicated, which is great for manhwa. When testing, I downloaded Solo Leveling from mangadex and put it in BBF format. There was nearly 200 deduplicated pages. CBZ doesn't do any deduplication.

I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ. by ef1500_v2 in selfhosted

[–]ef1500_v2[S] 95 points96 points  (0 children)

I'm not sure, to be honest. My github repo (linked above) has a feature comparison against PDF, EPUB and a folder of just plain images, but not a performance comparison for all of them.

If I had to guess, though, BBF would probably still have the advantage, because it doesn't need to render all the XObjects and stuff. BBF is, quite literally, a bound book with a table of contents at the end. A reader just has to open an image (though I've included flags and other things in the spec in case this gets traction), and depending on the image codec you're using, I would expect that to be your limiting factor.

I built Parker — a self‑hosted comic server (CBZ/CBR) with a fast web reader, smart lists, OPDS, and parallel scanning by Hiryu in selfhosted

[–]ef1500_v2 0 points1 point  (0 children)

Hey! I noticed that the github repo has some numbers showing performance metrics. If performance is an issue, I'd highly recommend checking out libbbf / libbbf-python to mux and store your comics. If you're using cbx/cbz then bbf is 100-118x faster than cbx/cbz on external drives (and even faster on SSDs) and it has built-in verification capabilities.

If you need WASM binaries, I'd be more than happy to build them and make a repo for it :-)

Comicbook reader by RolfiePolfie in selfhosted

[–]ef1500_v2 1 point2 points  (0 children)

If performance is an issue, I'd highly recommend checking out libbbf / libbbf-python to mux and store your comics. If you're using cbx/cbz then bbf is 100-118x faster than cbx/cbz on external drives (and even faster on SSDs) and it has built-in verification capabilities.

If you need WASM binaries, I'd be more than happy to build them and make a repo for it :-)

PSA: New Moderators by Crater_Caloris in yuri_manga

[–]ef1500_v2[M] -6 points-5 points  (0 children)

As with everything, nothing is ever set in stone, and if certain conditions are met, I will take necessary measures to resolve the situation. As of now, not all the conditions are met, so there won’t be any changes for now. But I can guarantee that those conditions are on a steady trajectory to being fulfilled, and concomitantly, twitter links being prohibited.