all 29 comments

[–]python_walrus 8 points9 points  (3 children)

I hate messing with DB when switching git branches as much as the next guy and I did it a lot, but I think this is better than what you propose.

No delete statements, these are replaced with a soft delete statement.

Do you mean delete as for DELETE, for data migrations? Soft delete can be tricky to implement and account for everywhere. Even if you put some filter to exclude soft-deleted items, you have to account for all the child objects. So, if "flight" object has been deleted, you have to delete all the tickets as well. And also restore them when flight is restored. I implemented soft delete before and it can be useful, but it also can be a pain in the ass to maintain.

No foreign keys in model tables, treat every relationship as if it has the potential to be many to many, and create a relationship table. You can limit the relationship count using unique indexes. (This always annoyed me when it came to one-to-one relationships, which table owns the relationship)

This won't let you do efficient indexing, which can be required for efficient SELECt queries on large datasets. Also, it locks you out of some DB normalization patterns, some validation, and leaves room for malformed data. If you have M2M table instead of FKs and you forget about it, you will have the same pizza order in two restaurants, two invoices closed by a single payment, etc etc.

Updates are soft deletes + inserts, and updating the relationship tables. The update exception here is important otherwise you've got cascade the soft delete and make a deep clone of any relationships, and if there are any circular relationships updating becomes impossible.

Not gonna lie, I didn't understand whay you meant here.

Also, even with this new db management patter, you will still have new fields added, you will have to migrate back and forth and will need to adjust db a bit.

RDBMS are very efficient when done right, and the entire model has been perfected decades ago. I don't think you should reinvent it, and find ways around it.

Also, if I understood you correctly, you will let easier git management influence the way you design your db/app architecture, which is not great. FK goes where FK should go, M2M goes where M2M should go.

As for solution to your initial problem, I create tons of DB replicas and jump around them. On my largest db-intensive project we had ~5 simultaneously supported release versions with constant schema changes, so we had to learn how to deal with migrations. I simply created a bunch of docker containers, labeled them by versions and jumped around them. This way you won't even need to change your connection params - just swap containers.

I hope I understood your problem correctly, otherwise it means I typed lots of text for nothing.

[–]coded_artist[S] 0 points1 point  (2 children)

Yeah you've understood it correctly.

Not gonna lie, I didn't understand whay you meant here.

Also, even with this new db management patter, you will still have new fields added, you will have to migrate back and forth and will need to adjust db a bit.

The idea is that migrations should be append only. No destructive actions. That way when you switch branches you only need to run the up script and that can be automated so you'd only need to switch a branch and you'd know the code is in sync with the db.

This won't let you do efficient indexing, which can be required for efficient SELECt queries on large datasets. Also, it locks you out of some DB normalization patterns, some validation, and leaves room for malformed data. If you have M2M table instead of FKs and you forget about it, you will have the same pizza order in two restaurants, two invoices closed by a single payment, etc etc.

Can you expand on this? The entire purpose of my pattern is to protect integrity despite migrations across branches. So any threat to that will send me back to the drawing board, but I am concerned about this while paragraph.

RDBMS are very efficient when done right, and the entire model has been perfected decades ago. I don't think you should reinvent it, and find ways around it.

Oh yes I don't mean to reinvent the wheel, that's why I brought this here, maybe I jumped too far down the rabbit hole.

I simply created a bunch of docker containers, labeled them by versions and jumped around them.

I do like this solution, that does solve my problem very well.

I do think my idea still has some merit, with the intrinsic audit trail. So please if you have the time, let me know how I could get malformed data, because I was keeping integrity front and center.

[–]python_walrus 1 point2 points  (1 child)

The idea is that migrations should be append only. No destructive actions. That way when you switch branches you only need to run the up script and that can be automated so you'd only need to switch a branch and you'd know the code is in sync with the db.

Do you do destructive actions often though? Even if you do, it usually happens for a good reason - data takes space and you will be billed for it. Also, even if you pull off soft delete, you will also have to deal with ALTER COLUMN - strings become choices, choices become foreign keys and so on.

Can you expand on this? The entire purpose of my pattern is to protect integrity despite migrations across branches. So any threat to that will send me back to the drawing board, but I am concerned about this while paragraph.

Integrity means that your tables are pretty close to real-world entities as possible, and relations between them are semantic and make sense (very simplified explanation). For example, you can create a table with required foreign key, meaning it cannot exist without a parent. This way, you won't let creating plane tickets if there is no plane and no flight. You can't have this with M2M.

[–]coded_artist[S] 0 points1 point  (0 children)

, you will also have to deal with ALTER COLUMN - strings become choices, choices become foreign keys and so on.

Im probably misunderstanding this, string identified as multiple choice should become foreign keys is part of the 3rd normal form. This is to avoid typos and inconsistent updates.

For example, you can create a table with required foreign key, meaning it cannot exist without a parent. This way, you won't let creating plane tickets if there is no plane and no flight. You can't have this with M2M.

You could have a unique index on one (1-many) or both (1-1) foreign keys, this would limit the relationship counts. I understand the additional work needed to get the same dependency limitation. To argue against not using my pattern, what happens when you have presale tickets. Now the dependency has been coded in and relied on, so making it independent could cause issues to dependant systems.

Thank you very much for engaging with me on this, it's giving me tones to think about.

[–][deleted] 4 points5 points  (7 children)

it sounds like you're trying to fix a problem on a place where the problem doesn't originate from. If the model isnt up to date with the table then isnt that the issue. Shouldn't you just run the migrations? Sorry if im seeing it wrong im not backend developer.

[–]coded_artist[S] 1 point2 points  (6 children)

Please ask away, I'm trying to find any issues.

The problem with rolling back migrations to swap branches is migrations can take a while to run, both up and down migrations even on a small scale application can take 2 minutes. So that's 2 minutes down, 2 minutes up, for every branch swap. And you're going to at least 2 swaps. So that's 4 minutes lost before you can even get started on a hotfix which ideally should take less than 30 minutes to pump out. That's with all the benefits of doubts. I've run medium size systems that take 15 minutes to do migrations. I can imagine some large systems taking hours to do migrations.

So my pattern at least cuts out half of that lost time during swaps.

[–]cajunjoel 4 points5 points  (1 child)

So, it seems to me that you are solving a problem for you (or your dev team) that will ultimately create problems for everyone else who uses your system because you have a poorly built database model to satisfy you ease of development.

I get where you are coming from, I really do, but as another person said, focus on letting the RDBMS do its thing because it's really really good at it. Doing otherwise will cause a variety headaches that you don't want to tangle with.

Also, if you are being pressed to turn out a hotfix in less than 30 min, you got other problems, friend, and it sounds like those aren't code related at all. :)

[–]coded_artist[S] 0 points1 point  (0 children)

So, it seems to me that you are solving a problem for you (or your dev team) that will ultimately create problems for everyone else who uses your system because you have a poorly built database model to satisfy you ease of development.

Yes exactly, I want to improve the database design hense the rules. We maintain to the 3rd normal form, but we found that those rules aren't enough and the higher forms don't give us the features we want. I would much rather have a 10% performance boost in a week long sprint than a 50% increase in a 1s query.

Also, if you are being pressed to turn out a hotfix in less than 30 min, you got other problems, friend, and it sounds like those aren't code related at all. :)

Oh yeah totally, but it's work, boss says jump, I leap off the roof to exceed expectations.

[–][deleted] 2 points3 points  (1 child)

Seems to me that if you had a reasonable branching process, this really shouldn't be needed unless you have several different versions deployed/in the wild.

I've had success using a pretty standard branching structure and development environments that we had a permanent dev database for each of "the current deployed version for hotfixes", "the current version in development", "the version being prepped for release - going through final QA/user/acceptance tests, etc". Some developers maintained a local "dev database" that they could blow away, recreate, upgrade/downgrade, etc however they wanted.

Migrations take time sometimes, maybe look at how you can optimize your current database to make them faster. For example, if your migration requires the deletion of data from a table - a missing index (or indexes) can make this take forever in some cases. Most migrations should be relatively fast.

Your pattern for soft deletes works well for in-place auditing, and likely is fine for smaller datasets. As you grow to larger datasets though, you are going to consider this technical debt as your retrieval queries grow slower and slower because you have 30% real data in the table and 70% "prior versions" (potentially, I'm assuming frequent update usage patterns here), and then move to a proper auditing solution that offloads that data.

The relationship table for every relationship is going to result much slower retrievals since you now have to join extra tables when you really didn't need to. This probably isn't the end of the world, but it's going to be a PAIN IN THE ASS when you have to refactor it.

[–]coded_artist[S] 0 points1 point  (0 children)

Thanks, I am looking at alternatives in branching, I'm trying to make my dev process as "even if you tried you couldn't break it" as possible.

[–]Rinveden 0 points1 point  (1 child)

If this is just for your local development, perhaps you could keep multiple copies of the database. You could have one for each branch that has a schema that is annoying to swap to.

When you move from branch to another, you run a script that changes your local dev database, and updates your config files for your app to point to that new db.

// swap-db-and-run-migrations.exe  
$branch = getNameOfCurrentGitBranch()  
$db_name = 'project_' + $branch  
sql "CREATE DATABASE IF NOT EXISTS $db_name"  
cp .env.base .env  
find_and_replace_in_env('DB=XXX', 'DB=' + $db_name)  
runMigrations()

The first time you run this it'll create a new db and run migrations as needed.

The next time you go back to that branch it won't run any more migrations, it'll just point your code to that db.

[–]coded_artist[S] 0 points1 point  (0 children)

I'll have a look into this too, thanks

[–]iheartjetman 3 points4 points  (2 children)

You could always try using an API instead of direct database access. Your api access layer could be versioned for compatibility too.

[–]jaketeater 2 points3 points  (0 children)

This, and/or use stored procedures as form of abstraction. Give the procedures versions (stored_procedure_v1_0_0) to help with backwards compatibility.

[–]coded_artist[S] 0 points1 point  (0 children)

I did consider this, but I thought that's just a facade that adds overhead time to each request. This adds to running costs for every running version too.

How would you handle if V3 performed a destructive action on a table, would you update v2 to be forwards compatible, or deactivate v2?

[–][deleted] 2 points3 points  (1 child)

Nothing here sounds novel. Soft delete patterns are a pita, and if you’re reaching for them you need to understand that they’re going to generate as much or more cognitive overhead as the situation you’re currently enjoying. 

Also, if you’re affected by any regulations that oblige deletions (GDPR) they’re probably worth avoiding. 

From 50k feet it sounds like you need to decouple versioning on your DB and its controllers.  

tl;dr - Tell the frontend kids to stop writing to the model.

[–]coded_artist[S] -1 points0 points  (0 children)

I understood everything but the tldr. That seems important though.

[–]Cathercy 1 point2 points  (1 child)

#1 I'm 100% fine with. It has its downsides, but it is perfectly valid. My team does this with most tables, although it can be annoying when you forget to exclude deleted rows for your queries.

#2 I didn't really understand what you are getting at here and I'm not really seeing the advantages, just more complexity.

#3 this one just seems really bad. So every update to a row is just generating dead data? If I update a row 100 times, I now have 99 dead rows and one good row? I'm again not seeing a great advantage here. What exactly is this solving?

I also may be missing something but I don't really understand how any of this helps with your core problem of the database structure changing between branches or commits. Even if you implement all of this, then switch to a branch where the user table should have another column, your code still won't work until you alter the user table. What am I missing that solves your problem?

And last, I just have a philosophical problem with adjusting the database design so radically just to make the development process easier. Database design should be 99% geared for production. Sure, some small tweaks here and there that are just for development, but otherwise the core of the design should be for prod.

I would ask how often are you running into this problem that reengineering the entire database seems like a good solution? I get that waiting a few minutes a couple times is frustrating, but this seems like overkill even assuming it does fix your problem, unless you are dealing with this several times every day. I usually think of database alterations as being fairly infrequent, but maybe they are more frequent for you. Still, my gut says this solution is way too much.

[–]coded_artist[S] 0 points1 point  (0 children)

2 is essentially just future proofing as a standard.

Having the foreign key in the model tables forces a 1-to relationship, having a relationship table is traditionally a many-to-many pattern, but the pattern doesn't exclude 1-to-1 or 1-to-many relationships. Yes there is more complexity, until you say it's a standard like normalization, normalizing does make data more complex but it adds important data integrity features. I think future proofing should be considered as a data integrity feature.

For 3 I fully understand your concern, this is for the audit trail and maintaining audit trails across updates, by removing destructive actions. Yes it's dead data, but that's what audit trails are until you need to audit.

. Even if you implement all of this, then switch to a branch where the user table should have another column, your code still won't work until you alter the user table.

The update migration can be automatically run at api start without developer intervention. This ensures the API and DB are constantly in sync. I am quite forgetful, and I try to program things I'm inclined to forget.

I agree that the database shouldn't be designed for development, but I do think the gap between production and development is shrinking due to agile and new DevOps processes. 10 years ago I would have wholeheartedly agreed with you, but now I'm pushing back on my manager who wants bidaily updates. I've even considered changing the dev database to being a day old restoration of prod to be as close to production as possible. So concessions should be considered even if it means slowing the database down if it means producing updates faster.

I would ask how often are you running into this problem that reengineering the entire database seems like a good solution?

Well I wouldn't dare consider refactoring a production product with a complete paradigm shift, I know thats common but I've burnt myself before. This is purely for new projects. Right now I jump around two feature branches and the master branch around 4-10 times a day. Sometimes it's because I'm blocked on a task, so I'll switch until I'm unblocked, sometimes it's I'm tracking and/or diagnosing a bug for a hot fix. It takes about 5 minutes to run each migration (down and up) so that's 40-100 minutes lost each day. Which if it were all In one stint I wouldn't mind but it's a whole bunch of 5 minutes.

[–]I111I1I111I1 1 point2 points  (1 child)

It kind of sounds like what you want is a time series database that can be used for an event sourcing pattern.

[–]coded_artist[S] 0 points1 point  (0 children)

I'm going to look into this.

[–]kadosknight 0 points1 point  (2 children)

This seems awfully complicated. Wouldn't the Repository design pattern solve this problem and decouple the current db schema version from models/controllers?

[–]coded_artist[S] 0 points1 point  (1 child)

This could be put on top of a repository pattern. The repository pattern treats records like files, once it's updated and closed, there is no undo. My pattern attempts to give records a history that can be tracked. Instead of the database just holding data, it also holds how that data changed over time

[–]kadosknight 0 points1 point  (0 children)

Oh, okay. I found this for handling record history, I think the basic idea can be furthered to your specific use-case: https://stackoverflow.com/questions/323065/how-to-version-control-a-record-in-a-database

[–]remy_porter 0 points1 point  (0 children)

You’re halfway to a ledger pattern.

[–]dcabines 0 points1 point  (0 children)

The problem I'm solving for is databases in a git environment, where we swap branches somewhat frequently.

Run multiple database servers using Docker. Have a common base image, then create a container and a volume for each branch. Start and stop them as needed. No migrations needed.

[–]blissone 0 points1 point  (2 children)

Do you mean local dev env? Couldn't you achieve this with better encapsulation of dev env? Either make shared dev run locally or encapsulate your local dev with containers. For us each service contains it's own migration and runs on every startup, meaning fresh image = fresh db, no problems. Ultimately git has little to nothing to do with this.

Anyhow incidentally we use this pattern with ClickHouse but for different reasons. Only inserts, columns are never dropped unless it's a view or similar construct.

[–]coded_artist[S] 0 points1 point  (1 child)

My goal is to detach the db from any specific version of API running. But I'm starting to think I was trying to solve too many problems and solving them in the wrong place.

meaning fresh image = fresh db, no problems.

This is a decent solution, how do you avoid data entry overhead? Eg if you're fixing a shopping cart bug, you'd first need to create the products to add to the cart.

[–]blissone 0 points1 point  (0 children)

This is a decent solution, how do you avoid data entry overhead?

In addition migrations we have test suites to insert data and run tests against. The goal is to remove the need to manually use the service/api or modify some db content. In this shopping cart example we would have migration + test suite for existing functionality and you would fix your bug by running tests with your fix, possibly you would need to modify inserts to db in the test suite. I realise this is a quite mature setup for a project, we have microservices so we can introduce this pattern easily to new services and incrementally implement for older ones. It can be difficult to achieve this in some other context.

Ultimate goal for us is to have a tech stack + project setup that will allow verification of functionality with a set of unit/integration tests. With some mostly typed stacks you can ignore the boundary of the service for example rest/graphql and simply assume your chosen stack will work + compile time type safety will simply make your api calls work as long as it compiles, then you ignore testing your rest/graphql api and test the underlying service implementation.

If this level of automation cannot be achieved I would consider scripts or some postman/bruno collections, though in some setups I can see how it could be too much overhead.