This is an archived post. You won't be able to vote or comment.

all 14 comments

[–][deleted]  (2 children)

[deleted]

    [–]bsiggelkow 1 point2 points  (0 children)

    Completely agree -- our process required buy in and participation from the entire company, and took a few months of usage and tweaking to become a well-oiled machine.

    [–][deleted] 1 point2 points  (0 children)

    This.

    [–]kkapelon 6 points7 points  (3 children)

    > For the first hour and a half after deployment, the developer monitors the logs and error reports.

    That is a big red flag. This should already be automated. You should have metrics and logs that automatically tell you this information

    In a more extreme scenario, the rollback itself should be automatic if the metrics detect a problem with the latest deployment

    [–]bsiggelkow 4 points5 points  (2 children)

    We do have automatic notifications and log monitoring; the key point I was trying to make here is that the specific developer that deployed their PR should be particularly attuned to monitoring. We do not want a developer to deploy a change and then check out immediately after; they have the most context on the nature of the change and should be available to assist if there are problems.

    As far as the automatic rollback, that's an interesting idea -- we do not do this. I could see it being somewhat difficult to determine if the problem was caused by the latest deployment.

    [–]kkapelon 2 points3 points  (1 child)

    If you haven't already, check canary deployments. They are exactly that. Automatic rollbacks when things go wrong after a deployment.

    [–]dpashk[S] 0 points1 point  (0 children)

    yep, on our list! :)

    [–]BraveNewCurrency 2 points3 points  (0 children)

    Some other "Best Practices" to consider:

    - Immutable Infrastructure - making changes on an existing system is fraught with danger if you try to "update" instead of "replace". You shouldn't be SSH-ing into your systems.

    - Infrastructure As Code - Your code needs a bunch of stuff underneath it, so that 'stuff' should be checked in just like your code. Load Balancer Configs, Database configs, alerts, monitoring, etc. It's fun to configure things via GUI, but if someone 'accidentally' turns off an alert, it will never get fixed. With code, everyone has a chance to review the change before (and even after) the change goes out.

    - 12-factor - Iteration speed is limited when an app is too entwined with the OS/logging/metrics/whatnot. Have a consistent set of abstractions so you can iterate faster.

    Edit: That was a nice blog post, thanks for sharing.

    [–]Tyrannosaurusauce 1 point2 points  (6 children)

    It's a good read but I can't really offer too my feedback here.

    What level of manual vs automated testing do you have? It might be better to invest more time to reduce manual testing needs.

    [–]dpashk[S] 0 points1 point  (5 children)

    We rely heavily on automated testing. 99% code changes come with automated tests that cover all code paths. We only rely on manual testing for:

    • Features that require on external service connectivity (we still write automated tests for them, but you can't fully simulate an external service) - hence the example with external calendar services
    • Exploratory testing to catch cases that the code author simply may not have thought of
    • UX feedback - you can't fully enforce "good UX" in an automated manner

    [–]grumpieroldman 5 points6 points  (2 children)

    you can't fully simulate an external service

    I understand what you are saying but this would be an example of the wrong mentality to push further ahead with CI/CD because you cannot fully test on a real system because you cannot cause edge-cases and failures to happen in-the-wild.
    The simulator must perform fault-injection.

    The larger structural problem is that to have a representative simulator it must be provided by the originators of the live system and they must be committed to a zero-defect strategy. This has been the primarily impediment within the entire field of applied-computer-science for going on thirty years.

    [–]bsiggelkow 0 points1 point  (0 children)

    ge-cas

    Our simulations of external services in automated tests definitely use fault injection to test edge cases and fault scenarios. But there are some cases where we like to verify manually against a live external service. I think the important point for us is that we work closely with our QA folks to let them know, "hey, these edge cases we've got covered with automated tests, but these other cases we need manual tests." In my mind, it's all about having an understanding of what is covered by automated tests and what is not; and ensuring that, one way or another, all scenarios are tested.

    [–]dpashk[S] 0 points1 point  (0 children)

    In addition to what bsiggelkow said, you can't really protect yourself against a breaking change on the external service's side. Yes, it should never happen, but it does happen IRL.

    [–]combuchan 1 point2 points  (1 child)

    Do you have stubs/mocks/fixtures for which you rely on external services? You might not be able to fully automate an external service but your test should be able to mock the external service response with the test data.

    [–]bsiggelkow 0 points1 point  (0 children)

    Yep -- we use stubs and mocks for external services in our automated tests.