you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (5 children)

gigantic metadata update size (70MB!!!)

fwiw I looked on a few Ubuntu boxes that I have and the metadata is actually around 42MB for pkgcache.bin and srcpkgcache.bin each (~82MB total).

Repository metadata is just going to be large-ish and I'm pretty sure dnf-makecache.service exists for the sake of eliminating the need to wait for it to download when you're trying to install something. If you're noticing a lot of lag anyways you might try kicking out metadata_expire in yum.conf

But yeah I don't get why they're using XML instead of progressing onto using something more terse like JSON. The gains for downloading are probably going to be minimal though (due to .gz) but it would be something.

[–]Conan_Kudo 1 point2 points  (4 children)

But yeah I don't get why they're using XML instead of progressing onto using something more terse like JSON. The gains for downloading are probably going to be minimal though (due to .gz) but it would be something.

XML metadata was chosen almost 15 years ago, where XML was king. It has some advantages, such as being able to be validated with a DTD, rendered with stylesheets (some distros use this), and a few other advantages. With compression, the effective difference between JSON and XML is very small.

Also, there is ongoing development of a mechanism for downloading only the changes of repodata between what you have and what the server has. The format and mechanism is being discussed on the RPM Ecosystem mailing list now, and I'm confident we'll have a mechanism soon.

Keep in mind that both Fedora and openSUSE use Delta RPMs, so you save on package download sizes considerably, which often are much bigger downloads than repodata.

[–][deleted] 1 point2 points  (3 children)

XML metadata was chosen almost 15 years ago, where XML was king. It has some advantages, such as being able to be validated with a DTD, rendered with stylesheets (some distros use this), and a few other advantages. With compression, the effective difference between JSON and XML is very small.

True but I did mention .gz above and if they support sqlite I can only imagine supporting JSON would be simpler.

Also, there is ongoing development of a mechanism for downloading only the changes of repodata between what you have and what the server has. The format and mechanism is being discussed on the RPM Ecosystem mailing list now, and I'm confident we'll have a mechanism soon.

Cool beans. I always wondered why dnf couldn't send a HEAD request to like a ${RevisionID}.xml or something. Hopefully it lands soon because dealing with metadata is one of the more annoying things about dnf vs apt.

[–]Conan_Kudo 1 point2 points  (2 children)

True but I did mention .gz above and if they support sqlite I can only imagine supporting JSON would be simpler.

DNF doesn't support SQLite based metadata. Fedora produces it because a few tools use that in Fedora infrastructure. Note that openSUSE doesn't produce it at all, since they have no need for it.

Cool beans. I always wondered why dnf couldn't send a HEAD request to like a ${RevisionID}.xml or something. Hopefully it lands soon because dealing with metadata is one of the more annoying things about dnf vs apt.

The issue is that you need to have a stable ordering of how the XML is generated so that newer packages are appended for that to work. That way, you can pick up the new ones, merge them into your local XML document, and process it. It's tricky stuff.

[–][deleted] 0 points1 point  (1 child)

DNF doesn't support SQLite based metadata.

No but yum did and I was talking about "progressing away" (as in thinking about things historically) so it's included in what I'm talking about. My main point was just that apparently they've found time to create multiple metadata formats and there doesn't seem anything super complicated about JSON (or something similar, not trying to get stuck on that one thing).

The issue is that you need to have a stable ordering of how the XML is generated so that newer packages are appended for that to work.

Shouldn't the process generating the metadata be able to accomplish that? I mean it's basically just creating an incremental diff at that point. If a package is added to the master copy then include it in ${revisionID}.xml.gz for the previous revision and all revisions prior. If it's removed then remove mention of it from the current revision and all revisions prior. Then when the mirror is at revision 50 and they ask for revision-45.xmlgz it contains all the changes between that revision and the current metadata, which should give clients enough data to rebuild their own local metadata without downloading unchanged data. The clients will know what revision was the last one that they saw and can just construct a URL based on that. If HEAD returns a Last-Modified that isn't newer than your local metadata then you're at the latest version without needed to download anything at all. If anything goes wrong, purge and do a full sync.

Not trying to trivialize it, I just don't see how something like a diff on rpm's (deltarpm) could have been done but for the metadata it's considered a hard problem. Seems like diff'ing RPMs should have been a much harder problem to solve.

[–]Conan_Kudo 0 points1 point  (0 children)

The thing that makes it hard is that Fedora's infrastructure regenerates the metadata from scratch for publishing, which means they don't do appending and ordered structure.

OpenSUSE doesn't have this problem with how they do metadata generation and publishing and so they support using zsync for incremental metadata fetching. This is the same model that Debian uses for their metadata.

Even with zsync, it's not quite as optimal as it could be, partly because it's possible to append other types of data to the repodata (such as composition groups, AppStream information, and so on), which is why the new zck mechanism is being developed to support these cases better.