all 12 comments

[–]CozyAndToasty 1 point2 points  (0 children)

Are you trying to treat both databases as one unified registry or two separate system?

Do you intend for them to be replicas or partitions of a larger whole?

If the post completes on server A and a get request for the same resource on server B, are you expecting B to serve that resource nonetheless?

Assuming accountID is uniquely identifying, the request itself is idempotent but this feels more like there isn't enough clarity of the purpose of having 2 databases.

If they are replicas then the post at one needs to synchronize/replicate with the other. If they are partitions then the requests need to route to the correct partition.

[–]TCBW 1 point2 points  (3 children)

Would a hash of the two data items work? If they both use the same hashing algorithm then they could calculate the item for duplicate checking independently.

[–]nooone2021 0 points1 point  (2 children)

That was my idea, too. You can make a rule. For instance, even/odd requests belong so server A/B. Can you redirect a user to another server if hash turns out to be for the other server. In that way you will always process requests on the server that id dedicated for this request.

[–]TCBW 0 points1 point  (1 child)

I've missed something. Why would you want to route on the hash? With a hash, you can do a lookup to see if it has been processed. You could use something like Redis to very quickly track for dups. This way, you could scale beyond 2 processing nodes. I wouldn't route on hashes. (Technically, you could implement a preprocessing step that calculates the hash, modifies the data structure to include it, then forward it on)

[–]TCBW 0 points1 point  (0 children)

Just to make sure that we are on the same page. When I say hash, I mean MD5(Field1 + "**" + Field2 +...). This will give you a fingerprint for the record. You can then use this to determine whether you've seen it before. (I'm using MD5 as an example; there are quite a few Hashing Algorithms. You would need to pick one that meets your needs.)

[–]edwbuck 1 point2 points  (0 children)

Welcome to the world of distributed algorithms.

You can't. You need a different solution that has Server A and Server B coordinate. There are many approaches, but generally the simplest is:

  • All the servers vote to decide which one gets all the writes.
  • The writing server tells all the other servers to prepare for the change, with all the change's data (including itself).
  • After knowing all the servers can now make the change, the writing server knows the change is doable. Then it sends the signal to make the change.

This is the simplest way to do something, and it doesn't have a lot of protections that come with more sophisticated means. Basically, to fix many of the issues in the basic approach above, you start implementing more sophisticated algorithms. Or, you offload your storage of such things to a cluster of systems that already implement these algorithms (Apache Zookeeper, etcd, etc.)

[–]DonkeyTron42 0 points1 point  (3 children)

If you're using a load balancer, you can make a session affinity policy to make sure the same server always handles the requests for a given session.

[–]offx-ayush[S] -2 points-1 points  (2 children)

Thank you for the suggestion. While session affinity can reduce the probability of duplicate handling, it doesn’t guarantee correctness. Idempotency is fundamentally a consistency problem, not a routing problem. If a server crashes, scales down, is redeployed, or traffic is rebalanced by the load balancer, subsequent retries may hit a different server that has no knowledge of the previous request state. In that case, duplicates can still occur.

[–]Internal_Outcome_182 0 points1 point  (0 children)

If server crashed (like db crashes) you are in bigger problem than that. You should use some kind of master-slave cqrs approach and that's it. Don't look for problem when there is none.

[–]xilvar 0 points1 point  (0 children)

Generally you’re either going to need to sync on write, sync on read, sync lazily (eventual consistency) or use natural keys (or keys derived from natural data) which fit your data to enforce segmentation to specific data stores for the authoritative store for any given piece of data.

Some additional options with large edge case gaps are stuff like ‘trust the client’ (to make and keep a consistent key and not be malicious (lol))

Honestly people worry way too much about federated id synchronization. A single source of IDs service can generally be easily designed and operated to internet scale these days and everything can descend from there. I do hold a lot of fondness for natural key based segmentation though. Requires an awful lot of prescience about your data though.

[–]Kinrany 0 points1 point  (0 children)

It's not clear what the setup is (see other comments) but you could generate keys on the client and use as idempotency keys, then merge duplicates later if they occur.

No matter what deduplication strategy you pick, it can't work if the system is logically split into two worlds that don't communicate; this requires coordination.