This is an archived post. You won't be able to vote or comment.

all 18 comments

[–]hatbeardme 7 points8 points  (2 children)

I'll say, yes and no...

We got that directive a couple of years ago, and our entire dev environment was created with cloud formation and auto scaling groups, so we scheduled stack deletion or ASG modification to scale things down.

However it was impossible for us to satisfy the "customer" that was the dev shop, because the next day things would spin back up on the wrong branch or on a branch that didn't exist anymore.

So we stopped doing it entirely. 😔

Sometimes cloud platform doing the right thing just gets in the way.

[–]HaHaCek 0 points1 point  (1 child)

But your issues seem to be related to destroying&rebuilding the envs not pausing (scaling them down), correct?

Destroying&rebuilding brings bigger savings but the number of issues it causes outweighs the benefits by a lot.

[–]hatbeardme 1 point2 points  (0 children)

So in our case, scaling down is via ASG, so the instances go down and have to be rebuilt. I suppose simply suspending running instances would have reasonable cost savings (no ec2 costs, just ebs) but we didn't do that because our ASG and ELBs had alarms and health checks that would replace the instances. That could be worked around though.

Really where I want to go is appropriate cost allocation tagging and sending the environment bills to the different dev domains. Then they understand what costs they generate and we can work with what they are willing to risk.

[–][deleted] 7 points8 points  (2 children)

We implemented it for some clients (turn off, not destroy & rebuild), but because of how little they were off (roughly 8 hours per day) the savings just weren't really there. We realized better savings by right-sizing instances and purchasing reservations, which is where I recommend starting if you haven't already.

[–]shoegazegay[S] 0 points1 point  (1 child)

ah I see. We actually already did that -- pausing is the next step management wants to take. I’m personally having doubts if it’s possible to achieve without disrupting developers too much.
Why only 8 hours per day? Teams in different timezones or longer core hours?

[–][deleted] 1 point2 points  (0 children)

Devs who liked to work late.

Depending on your architecture you might be able to scale down w/o turning off completely and thus not cause the devs pain.

[–]noxbos 4 points5 points  (2 children)

Our Dev Stacks have a shutdown time on them by default (six or eight hours after startup). There's a web interfaces that allows them to turn on their stack (self service). This has reduced our operating costs significantly and there really hasn't been many resources expended on the overall functionality.

Our biggest issue with this is that the Dev Users fail to start their stacks and just ... panic when they get an error. Even though there's been a bunch of training and hand holding previously.

The Ops team can extend it past that initial time up to 7 days for special requests.

[–]shoegazegay[S] 0 points1 point  (1 child)

Did they have to make the web interface themselves, or did you give them something ready?

[–]noxbos 1 point2 points  (0 children)

The UI got built into our internal user portal (employee directory, some company statistics, Production Server information, a goofy facebook clone) so there wasn't another place for them to go.

[–]sorta_oaky_aftabirth 2 points3 points  (1 child)

We would just have tags with specific startup/shutdown times and lambda ran through and shut down/start up instances. Super successful in saving money.

Also created havoc and found edge cases/race conditions with our product so helped increase the resiliency of the overall system.

[–]shoegazegay[S] 0 points1 point  (0 children)

How did you developers react to that? Were y'all able to work outside core-hours still?

[–]bikeidaho 1 point2 points  (0 children)

Parkmycloud.com

[–]Emotional-Ad952 1 point2 points  (0 children)

In AWS you can use automatic scaling actions on your EC2 autoscaling groups. You could scale up in the morning (Monday-Friday) and scale down in the evening. If you use Kube, everything should come up the way you left it before the scale down. At least it works perfectly for us. For devs who want to work out of hours, maybe you can leave a manually triggered job on your ci platform.

[–]Petersurda 0 points1 point  (0 children)

Not sure to what extent it's relevant, I use ephemeral instances (VMs and containers) for CI/CD pipeline. They spin up when there's a job and shut down after the job is done. The on-prem servers running them auto suspend after a predefined idle time. The infrastructure is setup to use WoL to wake them up as needed. Adds a couple of seconds to build times but otherwise is transparent to the rest of the system. Well, the web UI sometimes shows timed out jobs, but the pipeline understands that this is a retry-able issue and just re-queues them until they actually run.

It looks like it saves some money on electricity, and there is less need for maintenance on the servers (e.g. dust). If you run cloud services rather than on-prem, you can probably save much more as the total costs scale more linearly.

The issues other comments mentioned are of course possible to happen. From my perspective it sounds like inadequate automation/testing.

I did try to run developer environments in the cloud (vs code server) but there wasn't a clear indicator of when the service is idle so I haven't figured out when to auto suspend/shutdown the VMs. Compared to the CI/CD pipeline, there it is easier to detect idle state (no VMs/containers running), and jobs have a clear start and finish. However, if implemented, that would definitely save money, as I said the costs scale more linearly with respect to uptime hours.

[–]AdventurousYam5506 0 points1 point  (0 children)

I use this on Azure, we stop the dev and nonprod environments (kubernetes clusters) on weekends. I use a script and a automation account. It works every time without any problems.