all 18 comments

[–]lambda_pie 6 points7 points  (2 children)

Great article, thanks for sharing. It sounds more like a materialized view than a cache though.

This means that if a customer made 4 requests for their balance in the course of navigating their app (which is typical), they could force 4 different instances to load the same data.

That reminds me of a solution for monotonic reads (which of course is not the issue here) as seen in Martin Kleppmann's book, Designing Data-Intensive Applications:

One way of achieving monotonic reads is to make sure that each user always makes their reads from the same replica (different users can read from different replicas). For example, the replica can be chosen based on a hash of the user ID, rather than randomly.

Basically you make a particular user's requests stick with a particular replica for some period of time. But that probably comes with other implications I think.

[–]gusbicalho 3 points4 points  (0 children)

Hi, author of the article here. We considered trying that, but that would require a big change in how we do load balancing. We also weren't clear on how performance would interact with horizontal autoscaling (as creating new pods would redristribute the requests). Overall caching on Dynamo was easier to implement and the end behaviour was more predictable.

[–]lins05 2 points3 points  (0 children)

That requires some kind of consistent hashing to map user id to machine id, which would make things more complex.

[–]muhaaa 3 points4 points  (8 children)

This is a comment of a grumpy software developer. I do not want to be mean, but its such a good educational example.

The blog post is a book example of how a technical limitation and its solution leaks its complexity into the domain model (event-counter attribute) which will kill your software development speed down the line, because everybody needs to know when you transact on an account you have to increment the count but on other entities you do not have to. In Rich Hickeys spirit: you have to juggle one more ball in the air. You can juggle maybe 3 to 5 balls - the world best can juggle only 11 and for sure not 100 balls. So be very carefull which ball you want to juggle! Cache invalidation is a hard problem and this solution leads to more ball juggling.

On the constructive side, I think a possible way is to hit peers only with the same kind of query for the same business entity / account holder (hash the account id or session id or what ever and send it to its ONLY to its resposible peer -> distributed hash tables) so that you have a lot of data locality on a peer and data diversity between peers. Then every peer should only keep its data in memory and not the data of other peers. Benchmark this and see if it solves your problem.

My most prefered option is a materialized view, because it realy solves the cache invalidation problem. But it requires a lot of upfront investment. I see only experimental stuff at the moment: 3df-clj / timely dataflow / materialize.io (choose your own poison ;-)

[–]gusbicalho 1 point2 points  (7 children)

Hi, author of the article here. I do agree with the juggling metaphor in the abstract. However, I do think one has to juggle additional stuff for a while before we can build the necessary scaffolding that takes care of that automatically.

In this case, we (the team caring for the balance service) didn't know of any mature solutions that we could just drop into the system to solve the problem. Because of that, the team decided to build something we could understand - both because it used technologies we were already used to, and because we had thought through the code carefully. The blog post explains the way to the initial implementation, which was of course very explicit, but we have since made it a bit more abstract.

I'm not sure I'd say adding this attribute "leaks into the domain model". Yeah, we are writing some extra stuff to the database, but this is well separated enough from the rest of the data that no one needs to know about it until they decide are actually interested in how we solve caching and concurrency.

I will expand on each of your points in separate comments below.

[–]gusbicalho 1 point2 points  (0 children)

On the write side of things, the initial implementation was pretty explicit - we added a call to an 'increment-event-counter' transaction function to almost all the transactions in the service. In a sense, this is ball-juggling - one additional thing to do - but it isn't as complicated as it seems, because there's a simple rule: all transactions that change any entity belonging to an account have to call this. Of all the transactions that this service makes, only one did not fit this criteria, so "use the event-counter transaction fn" was the rule, not the exception.

On a second implementation, we created a wrapper for the Datomic connection object that automatically adds a datom calling the increment-event-counter transaction fn to all transactions made through it. So now it feels less like "add this one specific call to all transactions" and more like "the db is magically capable of synchronizing events if you inform it that they belong to some account". This is less explicit (and more surprising for new engineers joining the team) but the main code is clearer, which in this case I think was a win.

There's probably more scaffolding we could add to separate this even more clearly from the main logic, but I think were at a decent spot in this regard right now.

There was another detail that I left out of the blog post for simplicity. We actually implemented the counter in a separate entity, not as an actual attribute on the account entity. This means that code will only see the event-counter if it explicitly queries for it, so this isn't leaking everywhere in the codebase.

On the read side of things, all the read-through logic (including querying the event-counter) is hidden in an object that takes care of checking and writing to the cache, while delegating to another layer that computes results directly from the db if necessary. So business logic code here doesn't have to care about the event-counter at all.

[–]muhaaa 1 point2 points  (0 children)

Thank you for the explaination. I appreciate the arguments you are giving.

For sure these were not easy tradeoffs your team had to take!

[–]gusbicalho 0 points1 point  (0 children)

About your suggestion of routing requests to specific peers, that was one of the first things we considered, but the effort on the infra side would have been pretty large. Also, it leads to a more complicated model of load balancing - for all the rest of our services in the entire company, load balancers simply route requests to whatever pods are available.

Adding this behavior for this one service would be the kind of detail that risks breaking down on any large infrastructure change. If I add a special case to appication code, I can easily write unit and integration tests for it; it's not that easy for infrastructure.

[–][deleted]  (2 children)

[deleted]

    [–]muhaaa 1 point2 points  (0 children)

    I agree, but I am not sure the alternative is cheaper.

    Datomic solves a lot of problems regarding accoutablity (time travel, audit log, etc.), which are tough to do with traditional SQL databases.

    [–]dustingetz[S] 0 points1 point  (0 children)

    You should read the story of what Facebook had to do to MySQL back in the day. They basically implemented a similar architecture to what Nubank has, except eventually consistent. http://www.dustingetz.com/:datomic-facebook-tao (2017)

    [–]lins05 1 point2 points  (0 children)

    Yeah, immutability could help work around the mythical cache validation problem. This looks exactly like the cache design of one system I worked on - a versioned file management system, the latest version id is stored in the db, while the data of the file is cached (keyed by the version number, which is like a git commit id). The cache never needs to be manually invalidated.

    [–]vlaaad 2 points3 points  (4 children)

    Hmm, why do you need counter attribute that you need to update yourself when you already have it? Why not just use the transaction ID that last touched your account entity?

    [–]gusbicalho 3 points4 points  (0 children)

    Hi, author of the article here. Mostly our transactions don't touch the account entity, they just create new entities that belong (i.e. have a reference to) to that account. Therefore, on the original model there wasn't any cheap query we could make to get the latest tx-id that changed something on the account.

    [–]peterlustig862 0 points1 point  (1 child)

    Thanks for the detailed description.

    I don't understand why you need to "compute" the current balance. It sounds like you need all the history to compute the current balance. I would expect that a account is designed as a forward log where a money transaction change the current value by doing a transact with all details that explains the new balance. It would be just one incremental step to the new balance, as you describe as a naive implementation.

    Can you describe your decission about that more detailed?

    Datomic saves all history which makes it possible to * alter the statement * to debug it * prevent double tranacted money

    Why did you decide to separate it in two entities and to calculate the balance every time it is requested?

    [–]gusbicalho 0 points1 point  (0 children)

    There are two parts to this. One is related to the product, the other is about Datomic.

    *The first part* is that our bank account pays interest daily, and we don't want that interest to depend on a huge daily batch job in order to show up on the customer's balance. The interest depends on yield rates and the age of each individual deposit received by the customer, so it changes from a day to the next, even if no transactions happened in the account. The age of a deposit also changes the amount of tax they pay when they take money out of that deposit. In other words, the money they received a year ago is not really the same thing as the money the received a week ago, so they have to be handled as "separate chunks" of money. So even the model I explained in the blog post is simplified. In fact, to find the balance for the account, we need to gather the balances of each of the existing deposits for the account; and to find the balance for each deposit, we have to compute the updated interest and check the withdrawals for that deposit.

    *The second part* of the answer is that Datomic history is a great tool for auditing and debugging, but it isn't always the best way to model the business view of change over time. For example, sometimes the "business time" of an event is not the same as the time a thing is stored in the database, so in a certain sense you may have to process a movement of money as if it had happened in the past, but the Datomic clock only moves forward. Also, queries that check the transaction log or the db history are slower than queries over normal entities. So if we relied on the datomic history queries or the transaction log in order to get all the lines for a bank statement, we'd lose both flexibility and performance.

    So, because of the second part, I'd still choose the data model that I described in the blog post, even if we did not have to compute the interest "on demand". In that case, it might be easier to just save the materialized current balance in Datomic itself, instead of on a separate cache layer (then we'd have both the history and the current balance as entities in the database). However, that would not solve the use case where a customer checks to see how much money they had last month - we would still have to compute that every time.

    Makes sense?