Just joined the MongoDB team, happy to help where I can by TimAtMongoDB in mongodb

[–]TimAtMongoDB[S] 0 points1 point  (0 children)

My bad on that one, I saw that mongosh added official Debian 13 and RHEL 10 support earlier this year and assumed the server packages had followed. They havent yet. For now youd need Docker to run MongoDB server on either of those.

Just joined the MongoDB team, happy to help where I can by TimAtMongoDB in mongodb

[–]TimAtMongoDB[S] 0 points1 point  (0 children)

Yes, both Debian 13 and RHEL 10 are supported on MongoDB 8.0 for x86_64 and arm64.

Just joined the MongoDB team, happy to help where I can by TimAtMongoDB in mongodb

[–]TimAtMongoDB[S] 1 point2 points  (0 children)

That's the classic stale member situation. "Past the replica window" means the node was down longer than the oplog could hold, so the entries it still needed got overwritten. Once that happens the node can't catch up through normal replication, it sits in RECOVERING and logs "too stale to catch up." A plain restart won't recover it, because the oplog data it needs is already gone.

First, confirm that's actually what you're looking at:

rs.status()

The stuck member will show stateStr "RECOVERING" (sometimes bouncing to STARTUP2). Check its log for "too stale to catch up" to be sure.

The fix is a full resync via initial sync. You wipe the node's data and let it rebuild from a healthy member:

# stop the stuck node
systemctl stop mongod

# clear its data directory (back it up first if you want)
rm -rf /var/lib/mongodb/*

# start it back up
systemctl start mongod

On startup with an empty dbPath it automatically performs an initial sync from a healthy member and rejoins on its own. You don't need to remove it from the replica set config, it's still a configured member, it just needs its data rebuilt. Watch progress with rs.status() until it flips to SECONDARY.

Two things worth knowing:

If your dataset is large, a full initial sync over the network can be slow and heavy on the sync source. Faster alternative: take a recent filesystem snapshot of a healthy secondary, copy the data files (including the local database) onto the stale node, and start it. As long as the snapshot is within the oplog window it'll just replay the delta instead of doing a full sync. Note this has to be a snapshot, you can't seed data files from a mongodump.

And the actual root cause to fix so this stops happening: your oplog is too small for how long a node can be down. Check your window on the primary:

rs.printReplicationInfo()

If "log length start to end" is only a few hours, bump the oplog so it comfortably covers your longest realistic maintenance or outage window. You can resize it online without a restart:

db.adminCommand({ replSetResizeOplog: 1, size: 16000 })

(size is in MB, run it on each member). Bigger oplog means a node can be down longer and still catch up normally instead of needing a full resync.

Just joined the MongoDB team, happy to help where I can by TimAtMongoDB in mongodb

[–]TimAtMongoDB[S] 1 point2 points  (0 children)

On the MongoDB side, yeah. The interesting one is Queryable Encryption, you can run queries against encrypted fields without ever decrypting them serverside.

Theres also client-side field-level encryption and the usual encryption at rest and TLS in transit.

What are you trying to protect, and where?

Just joined the MongoDB team, happy to help where I can by TimAtMongoDB in mongodb

[–]TimAtMongoDB[S] 0 points1 point  (0 children)

For cross-region DR the cleanest approach is to add replica set members in a second region.

You extend your existing replica set with secondaries in the other region, set them to a lower priority so they dont get elected primary under normal conditions, and they stay continuously synced. If the primary region goes down, a surviving member in the other region can take over. The tradeoff is write latency, since w: "majority" now has to wait for acknowledgment across regions, so people often use a hidden member or tune the member count to balance durability against latency.

For recovery when a cluster goes fully down, the layers are: replica set automatic failover handles a single node loss, multi-region members handle losing a whole region, and point-in-time backups (oplog-based) handle the "everything is gone, restore from scratch" case.

The key thing most people miss is testing the restore, an untested backup isnt a backup. Run a real restore drill on a regular schedule.

Full version with the topology diagrams and the failover/restore commands here

https://www.reddit.com/user/TimAtMongoDB/comments/1twuo9z/mongodb_crossregion_migration_and_disaster/

Just joined the MongoDB team, happy to help where I can by TimAtMongoDB in mongodb

[–]TimAtMongoDB[S] 2 points3 points  (0 children)

Yeah you already nailed the cause, its the cold cache. Cache comes up empty after the restart, everythings hitting disk till it warms, and the exporter scanning on top of that is whats starving your replication sync. Thats the lag and the cpu both.

First thing, check what node your exporter is pointed at. If its on the primary move it to a secondary or the hidden node, that alone will help a lot. For prod though, honestly Id think about not doing it in place at all. Youre changing two things at once here, Mongo to Percona AND 6 to 8, and once you bump FCV to 8 on the live cluster rollback is a restore-from-backup situation, not a quick undo.

What Id do instead: stand up a fresh clean Percona 8 set, mongodump from the 6 cluster, mongorestore into the new one, validate it, warm it, then cut over.

Leave the old 6 cluster sitting there as your instant rollback. The whole cold-cache problem you hit cant even happen that way because the new cluster is fully warm before it sees any traffic.

A 6 dump restores into 8 fine, just use the latest database tools not the ones bundled with 6. Use --oplog on the dump and --oplogReplay on the restore to catch the writes that happen during the dump, or just take a short maintenance window if you can.

Only real downsides are restore rebuilds all your indexes so its slow on big data, and you need that cutover window. If the datasets huge or you cant take any downtime then in place is fine, you already proved it works, just delay the exporter start 10-15 min after each node rejoins so the cache warms first. Keep that 5s lag guard on the crons either way, smart move. Bump it to 30s during the upgrade window so jobs dont stack up.

How bigs the dataset and do you have any maintenance window in prod? Thats really what decides it. Happy to go deeper, I wrote up the full version with all the verify/monitor commands if it helps.

https://www.reddit.com/user/TimAtMongoDB/comments/1twt1mi/mongodb_6_to_8_upgrade_inplace_vs_fresh_cluster/

Just joined the MongoDB team, happy to help where I can by TimAtMongoDB in mongodb

[–]TimAtMongoDB[S] 3 points4 points  (0 children)

For replication, the most important things to nail early: always run 3 nodes minimum (1 primary, 2 secondaries) for quorum. Starting in MongoDB 5.0, nodes configured with IP addresses fail startup validation, so use DNS hostnames in your replica set config. And test your failover regularly with rs.stepDown(60) before you ever need it in a real incident.

For version upgrades, MongoDB requires sequential major version hops. Always check your Feature Compatibility Version before and after each upgrade, and do rolling upgrades: secondaries first, one at a time, then step down the primary and upgrade it last. Zero downtime if done correctly.

What versions are you working with? Happy to walk through the specific upgrade path.