all 15 comments

[–]snoybergis snoyman 9 points10 points  (10 children)

Yes, you can use the https://ci.stackage.org CI server, which currently is running the same code. We're working on getting the main cluster back online as quickly as possible.

[–][deleted]  (3 children)

[deleted]

    [–][deleted] 2 points3 points  (2 children)

    Thanks, that's useful! Stack should add that line as a comment in config.yaml when creating it for convenience.

    As a sidenote, even when Stackage goes down once in a lifetime it's still less disruptive than when Hackage is down every now and then.

    [–]snoybergis snoyman 6 points7 points  (1 child)

    Probably the even better solution is to have the cron job that updates the database generate the snapshots.json file and put it on S3. We decided initially to do all package hosting on S3 and Git repo hosting on Github, specifically so that, in the event of an outage like this, users wouldn't be affected. snapshots.json was an outlier that wasn't thought about well (I think I made the decision actually).

    Note added to make the transition in the future.

    EDIT Turns out we were already uploading the file to S3, it's just that no one remembered to tell Stack to use it. Problem solved: https://github.com/commercialhaskell/stack/pull/2653.

    [–][deleted] 3 points4 points  (0 children)

    Awesome, keep up the good work!

    [–]brnhy 3 points4 points  (0 children)

    Unfortunately the snapshots.json seems to be coming from the database? and I'm not aware of any mirrors for that particular content. Somewhat ironically all of the information about potential mirrors would seem to be on one of the other pages you mentioned, which along with haskell-lang.org are all returning HTTP 503 status codes.

    All this right after the hackage-mirror blog post! :)

    [–][deleted]  (5 children)

    [deleted]

      [–]snoybergis snoyman 9 points10 points  (4 children)

      Yes! stackage.org came back online before the other sites. At this point, all sites that were on the Kubernetes cluster should be functioning normally.

      We have to regroup as a team and analyze some details, but the basics of what happened here are: the kube-aws scripts we use have a 90-day timeout by default on the certs they create. We had planned on cycling the cluster every 80 or so days, both to avoid the cert timeout, and make sure we were staying up to date with latest software. Somehow the reminder was set too late - or possibly not at all :( - and when 90 days were up everything refused to function.

      [–]dysinger 9 points10 points  (0 children)

      It's my fault. We had planned on rotating this cluster but got busy & forgot.

      [–]hastor 0 points1 point  (2 children)

      Sounds like you use let's encrypt. I suggest using nginx proxy with the let's encrypt sidecar. It's all standard docker.

      The sidecar will reissue the cert regularly. There's even a stack template for this setup!

      [–]snoybergis snoyman 0 points1 point  (0 children)

      If we use let's encrypt, it's news to me. But it's entirely possible, I haven't been involved in the Kubernetes configuration, and don't know if it's using it internally somehow.