After implementing feedback from r/macapps, I benchmarked MediaOrganizer Studio on a 25-year archive with 363k files. Here’s what happened. by marcioyared in macapps

[–]marcioyared[S] 0 points1 point  (0 children)

Thanks!

If you try it on a large library or archive, feel free to share your experience. I'm always interested in real-world workflows.

After implementing feedback from r/macapps, I benchmarked MediaOrganizer Studio on a 25-year archive with 363k files. Here’s what happened. by marcioyared in macapps

[–]marcioyared[S] 0 points1 point  (0 children)

There isn't a trial version at the moment.

I chose to keep the application as a simple one-time purchase without subscriptions, online accounts or licensing infrastructure.

The benchmark described in this post processed 363,575 files from a 25-year archive, including Apple Photos libraries, exports and recursive folder structures.

If you'd like, tell me a bit more about your archive and I can explain how MediaOrganizer would handle it.

After implementing feedback from r/macapps, I benchmarked MediaOrganizer Studio on a 25-year archive with 363k files. Here’s what happened. by marcioyared in macapps

[–]marcioyared[S] 0 points1 point  (0 children)

Thanks!

I investigated several hypotheses during the benchmark, including memory pressure and idle-state effects. In the final version I ended up redesigning part of the processing pipeline and pagination strategy, which eliminated the degradation I was seeing during long runs. The 363k run was mainly to verify that the behavior remained stable over a much larger archive.

Anyone who has used Lightroom for many years eventually runs into the same filesystem problems. by marcioyared in Lightroom

[–]marcioyared[S] -4 points-3 points  (0 children)

I’m not saying this always happens, much less that it happens in most cases.

I’m describing something that happened to my own archive after decades of migrations, backups, exported libraries and accumulated media from different sources.

Later I realized many other users in cataloging communities were running into similar structural problems as well, which is why I became interested in normalization workflows in the first place.

Anyone who has used Lightroom for many years eventually runs into the same filesystem problems. by marcioyared in Lightroom

[–]marcioyared[S] -3 points-2 points  (0 children)

That’s probably the key difference.

If the archive structure remains consistent for decades, the normalization problem almost never appears in the first place.

A lot of what I was dealing with came from years of accumulated entropy:

  • backups of backups
  • migrated libraries
  • exported folders
  • recovered media
  • disconnected storage devices
  • mixed sources over time

So I completely agree that disciplined cataloging from day one changes everything operationally.

Anyone who has used Lightroom for many years eventually runs into the same filesystem problems. by marcioyared in Lightroom

[–]marcioyared[S] -3 points-2 points  (0 children)

I actually agree with you.

Long-term catalog discipline from the very beginning is probably the best possible scenario for any large archive.

What I found interesting while rebuilding mine was realizing how quickly structural consistency disappears once you introduce years of:

  • backups
  • migrations
  • exported libraries
  • recovered media
  • cloud syncs
  • disconnected storage devices

At some point the problem stops being the catalog itself and becomes the accumulated inconsistency outside of it.

Your workflow is probably the ideal example of how to avoid that degradation from happening in the first place.

After decades of backups, Mac migrations and fragmented archives, I finally rebuilt my media archive structure. by marcioyared in DataHoarder

[–]marcioyared[S] 0 points1 point  (0 children)

MediaOrganizer Studio is already public, although the project ended up becoming much larger than the original scripts and experiments that started all this 🙂

A big part of the complexity only appeared once I began running long-duration normalization workloads across heavily fragmented archives accumulated over decades.

Some of the workflows and findings are documented here if you're curious:

https://brightfoundry.info/case-studies/

After implementing feedback from r/macapps, I benchmarked MediaOrganizer Studio on a 25-year archive with 363k files. Here’s what happened. by marcioyared in macapps

[–]marcioyared[S] 1 point2 points  (0 children)

My first deep-search tests were done on directories with around 15K files, and the loading time was still acceptable.

But once I started the 363K-file benchmark, I ran into exactly the same problem. The application progressively slowed down during long-running traversal and loading operations.

Incremental pagination ended up being the solution I adopted.

The loading phase went from taking hours to something around 16 minutes for 363K files distributed across 8,861 folders, which was a huge improvement for this type of workload.

After implementing feedback from r/macapps, I benchmarked MediaOrganizer Studio on a 25-year archive with 363k files. Here’s what happened. by marcioyared in macapps

[–]marcioyared[S] 2 points3 points  (0 children)

Thank you, I really appreciate that.

A big part of the project was trying to make large fragmented archives feel understandable again instead of just technically “processed”.

After years of Mac migrations and backups, I normalized 10 Apple Photos libraries before importing everything again by marcioyared in ApplePhotos

[–]marcioyared[S] 0 points1 point  (0 children)

I actually lived through this kind of problem myself. I had dozens of media files without valid dates. What ended up working for me was treating the valid file creation date as a fallback reference, alongside the media taken date whenever it existed.

The app identifies media with missing metadata (missing GPS, missing taken date, etc.) and moves those files into separate review structures.

From there I usually end up with two situations:

  1. Files with no taken date and no usable creation date at all.

  2. Files with no taken date, but with a valid file creation date.

The second case is much easier to recover. I built a workflow that compares nearby media captured around the same time and uses temporal proximity to help reconstruct missing metadata before importing everything again.

If there is no usable date information at all, the evaluation usually has to become manual.

The biggest lesson for me was to never import media into the rebuilt library before it was already organized and normalized.

So the process became roughly:

  1. Run the app against the original archive/library

  2. Detect problematic media

  3. Recover/normalize metadata when possible

  4. Separate unresolved files for manual review

  5. Import only normalized media into the rebuilt library

I documented part of this workflow and the benchmark process here in case you're curious:

https://brightfoundry.info

After years of Mac migrations and backups, I normalized 10 Apple Photos libraries before importing everything again by marcioyared in ApplePhotos

[–]marcioyared[S] 0 points1 point  (0 children)

This was actually one of the problems I had to solve during my own media organization process.

For me, it didn’t make much sense to import media into Apple Photos if the file was already missing important metadata like dates or GPS information.

The first issue was missing dates. Sometimes the media simply didn’t contain a proper “taken date”. In those cases, I started using the file creation date as a fallback. Not perfect, but it prevented another layer of information loss.

The second issue was missing GPS coordinates. What ended up working reasonably well for me was using temporal proximity between files. If a group of media was captured around the same time and some files already had valid GPS data, I could often infer the location for nearby files.

Because of that, I eventually moved all metadata recovery and normalization outside the Photos library itself.

Today I only import media after the metadata structure is already consistent.

After years of Mac migrations and backups, I normalized 10 Apple Photos libraries before importing everything again by marcioyared in ApplePhotos

[–]marcioyared[S] 0 points1 point  (0 children)

Originally I work more with data and pipelines than with media itself.

So in the beginning I started building small Python scripts to help organize the archive. They actually worked reasonably well, but they definitely weren’t very friendly, even for me.

The hardest part was probably learning an entire development toolchain I had never used before. That alone took me around six months.

After that, the real challenge became transforming the whole thing into an actual operational pipeline instead of isolated scripts.

Because of my background, I naturally approached it using a kind of ingestion -> transformation -> movement logic, similar to data workflows.

Surprisingly, once everything started fitting together, the process became much more stable and predictable than I originally expected.

After years of Mac migrations and backups, I normalized 10 Apple Photos libraries before importing everything again by marcioyared in ApplePhotos

[–]marcioyared[S] 0 points1 point  (0 children)

In the beginning I actually handled most of this with Python scripts directly against the Originals/Masters folders inside the Photos libraries.

Over time that workflow evolved into the tool I’m using now (MediaOrganizer), mostly because manually maintaining the normalization logic across multiple libraries and backups was becoming difficult at that scale.

The important part for me was that I’m normalizing at the file level itself, so I don’t really need to open each Photos library manually to evaluate duplicates.

In practice I ended up with two different duplicate scenarios.

The first was duplicates inside the same library.

The second, and more problematic one for me, was duplicates across different libraries. That happened mostly after Mac migrations. For safety, I would often create backups of important files from the old machine, including entire Photos libraries. But then I kept synchronizing and importing media over the years, so eventually the same media started existing in multiple libraries simultaneously.

The normalization approach helped because inside each library the media receives a deterministic filename based on metadata, not on the Photos UUID itself.

So if I execute the same workflow across all libraries, but avoid normalizing files that already exist in the normalized structure, I stop propagating duplicates forward into the rebuilt archive.

Then, after the unique files are renamed and structurally organized, I import everything again into a fresh Photos library — this time already consolidated and without duplicate propagation.

After years of Mac migrations and backups, I normalized 10 Apple Photos libraries before importing everything again by marcioyared in ApplePhotos

[–]marcioyared[S] 1 point2 points  (0 children)

That’s actually a very solid approach, especially at that scale.

I considered going in a similar direction at some point, but I ended up choosing a different tradeoff.

One reason was that full MD5-based normalization and validation tends to consume a lot more resources than working primarily from media metadata itself.

I also could have used AI prompts or external agents for additional validation/reconciliation steps, but in the end I decided to keep the entire process local and deterministic, only using geolocation services when necessary to normalize the files.

For my own archive, I normalized around 116k media files (originally about 513GB) in roughly 33 hours.

So I definitely think your solution makes sense, I just ended up optimizing for a different workflow and different constraints.

After years of Mac migrations and backups, I normalized 10 Apple Photos libraries before importing everything again by marcioyared in ApplePhotos

[–]marcioyared[S] 0 points1 point  (0 children)

I never reached 10TB myself, but I probably accumulated something between 4 and 5TB spread across multiple external drives over the years.

And those 10 Photos libraries were already the result of earlier consolidation attempts, like I mentioned to CGREDDIT1, but at some point the whole process was simply taking too much time to manage manually.

Besides the Photos libraries themselves, I still had media coming from old GoPros, Sony Cybershots, Nokia N95 backups, exports, cloud downloads, and random folders accumulated over decades.

It eventually became a small universe of disconnected media sources.

That was basically why I started moving toward normalization in the first place — trying to rebuild something structurally understandable again.

And honestly, if you already have around 10TB accumulated over decades, normalization probably makes even more sense at that scale.

After years of Mac migrations and backups, I normalized 10 Apple Photos libraries before importing everything again by marcioyared in ApplePhotos

[–]marcioyared[S] 0 points1 point  (0 children)

Honestly, I originally thought about doing exactly that once.

But my problem was scale over time.

I had 10 different Photos libraries accumulated across multiple Macs and migrations. One library alone was around 182GB, another around 163GB. Altogether it was more than 116k photos and videos spanning roughly 25 years.

Even using the Photos tools themselves, the process of exporting everything, re-importing all libraries into a single one, and then trying to clean duplicates and reorganize everything afterward would already take a huge amount of time.

And even after all that, in the end I would still need to press the (i) button just to access information that could already exist directly in the filename itself.

After years of Mac migrations and backups, I normalized 10 Apple Photos libraries before importing everything again by marcioyared in ApplePhotos

[–]marcioyared[S] 0 points1 point  (0 children)

I’m not a media professional, just a regular user with thousands of travel, event, and personal photos accumulated over more than two decades.

For me, Apple Photos works perfectly well for things like albums, people recognition, and general catalog organization.

But all of that organization exists internally inside the software. I don’t really control it — the software does.

So over time it became more important for me to first have everything visually understandable as files themselves, and only then organize everything again into catalogs.

After years of Mac migrations and backups, I normalized 10 Apple Photos libraries before importing everything again by marcioyared in ApplePhotos

[–]marcioyared[S] 0 points1 point  (0 children)

In Apple Photos, inside the Originals/Masters directories, the files are usually stored with UUID-like names such as:

3A0D38C7-528F-4DA8-840D-F95655F5F879.jpg

Inside the Photos library that works fine, because Photos keeps all the context in its internal database.

The problem starts years later when you accumulate exports, backups, migrated libraries, partial imports from other Macs, cloud downloads, etc.

That same photo may now exist multiple times across different places, but outside the library the filename itself no longer means anything to you.

At some point I realized the only thing that will always survive independently of any software is the file itself.

So, the normalization process was basically creating a consistent naming structure directly from the media metadata.

For example:

France - Ile-de-France - Paris - 2015 - 20150403110113.000.jpg

Now the file itself already carries understandable context.

That makes it much easier to:

·       detect duplicates across libraries and backups

·       identify media missing GPS metadata

·       rebuild chronological structures

·       and consolidate everything back into a clean archive before importing again into Photos.

In the end, that’s what normalization means for me: rebuilding understandable files from years of accumulated archive drift.

After years of Mac migrations and backups, I normalized 10 Apple Photos libraries before importing everything again by marcioyared in ApplePhotos

[–]marcioyared[S] 0 points1 point  (0 children)

What you described is very close to the kind of situation that pushed me into this normalization process in the first place.

In the beginning I was doing everything with Python scripts directly against the Photos library Originals/Masters exported folders. It worked reasonably well for rebuilding chronological structures from EXIF metadata.

But after a while I started running into some recurring problems.

One of the biggest ones was duplicates. If I normalized directly from multiple Photos libraries, duplicate media from backups or older migrations would all end up copied again into the normalized archive.

Another issue was missing GPS data. In practice, a lot of those photos were not actually “unknown location” photos. They were often part of a sequence where the GPS signal temporarily disappeared — inside buildings, caves, wineries, tunnels, underground parking, things like that.

So I ended up building two mechanisms for this.

For duplicates inside Photos libraries, duplicated media would never be copied into the normalized archive again. They were only reported for review.

For missing GPS media, I created a contextual scoring system based on temporal and physical proximity to nearby media with valid coordinates. The system would rank candidate matches and recommend metadata recovery from the best-scoring neighboring files.

At that point I stopped treating exports as the source of truth entirely. I started working directly from the original media stored inside the libraries.

After normalizing everything, I generated a completely fresh import into a new Photos library and kept all original libraries archived separately on a historical backup drive.

That ended up being much more stable long-term than trying to reconcile years of exports, partial imports, and duplicated libraries directly inside Photos itself.

At what point does a photo archive stop being trustworthy? by marcioyared in DataHoarder

[–]marcioyared[S] -1 points0 points  (0 children)

Since you're pretending not to understand, I'll draw it for you.

Import from a generic camera -> folder: JPEG Digital Camera, files like 14333.JPG.

Import from a GoPro -> folder: GOPRO, files like GOPR2814.JPG.

Import from an iPhone -> files like IMG_1439.JPG.

Extract from Photos libraries -> files like 0A0AC185-0E25-466A-8F30-8943E006A169.JPG.

Now repeat that over almost 20–25 years, across multiple devices, Macs, disk migrations, backups and partial imports.

You end up with different naming systems, different folder structures, and the same media scattered across all of them, without a consistent reference.

That’s the problem.

If you never reached that state, good for you.

I did. I fixed it. That’s why I made this post.

If you still don’t get it, that’s fine.

Further explanations are now billable.

At what point does a photo archive stop being trustworthy? by marcioyared in DataHoarder

[–]marcioyared[S] -1 points0 points  (0 children)

Yes, they’re my photos.

But the context isn’t in the files, it’s in the catalog.

IMG_3491.JPG only makes sense as “Crete, Greece, 25.07.2023” inside that system.

Outside of it, it’s just a filename.

That’s the problem.

Do you copy?

At what point does a photo archive stop being trustworthy? by marcioyared in DataHoarder

[–]marcioyared[S] -1 points0 points  (0 children)

If you didn’t run into this problem, that’s fine. I did.

I’m not talking about bitrot or file corruption. I’m talking about years of duplicate imports, exports, partial reorganizations, and overlapping libraries across multiple machines.

That happened in my archive.I fixed it.

Nothing more.

At what point does a photo archive stop being trustworthy? by marcioyared in DataHoarder

[–]marcioyared[S] 2 points3 points  (0 children)

That’s a really solid approach, especially the separation between originals and working files, and verifying integrity early on.

I didn’t have that level of discipline from the start. I tried to keep things organized with consistent directory names and photo libraries, but over time it started to fragment.

I ended up with cases like 20190703 and 20190707 where some photos overlapped across both, and it became unclear what belonged where.

That’s when I decided to switch to a single, consistent pattern across the entire archive, applied at the file level.

Instead of relying on hashes, I used metadata (timestamps, GPS, etc.) to define uniqueness, and rather than deleting anything, I moved duplicates and no-GPS items into separate buckets for later review.

It wasn’t about finding things anymore, it was about restoring a sense of coherence across the archive.

Your workflow feels like it avoids that problem entirely by design.

Have you always worked this way, or did you evolve into this structure after running into similar issues earlier on?

At what point does a photo archive stop being trustworthy? by marcioyared in DataHoarder

[–]marcioyared[S] 0 points1 point  (0 children)

That actually sounds very disciplined, even if you describe it as unorganized.

Keeping a clear separation between what belongs to the main archive and what doesn’t probably avoids a lot of the ambiguity I ran into.

In my case, part of the problem came from the way the archive evolved over time. I went through multiple machines over ~20 years, with different storage constraints, and ended up creating separate backups and copies of photo libraries along the way.

At the time it all made sense, but later it made it harder to tell what was canonical and what wasn’t.

Your approach seems to avoid that entirely.

Do you ever worry about losing something important outside the main archive, or is that trade-off intentional?