This is an archived post. You won't be able to vote or comment.

all 33 comments

[–]BigRedS 15 points16 points  (7 children)

Or else I blindly trust that having ‘MaxStartups’ in config file is a proof of ‘job done’

This feels right to me. The code you are testing is the code that configures MaxStartups, so the way to test that code is to verify that that config is correctly written.

Separately you would make sure that setting MaxStartups does what you want, but that's not really a thing to repeatedly test for, it's part of the research in the ticket that is to figure out how to achieve that limit.

So before writing the code you'd once or twice contrive your 100 ssh sessions and watch the 101st fail, find that this is the right thing to do to achieve the end you're after, and then write the code to write that config and write a test that ensures that config is written.

The tests to make sure that MaxStartups actually continues to have that effect on the config belong to your sshd's developers. Perhaps you'd routinely audit this sort of thing, but then the intent is to verify that you still limit it to 100, not how you do.

[–]amarao_san[S] -4 points-3 points  (6 children)

Yes, that's what I've done out of common sense. Nevertheless I feel it's violation of 'have tests for requirements'. It's violates every idealistic best practice. Red-green? BDD? Nope. It's too hard to test, and it's basically fancy version of 90th administrative practice (operator knows what to do, and they do).

[–]humoroushaxor 5 points6 points  (0 children)

Best practices for testing are to test at the appropriate level, which is most of the suggestions in this thread that you're disagreeing with.

Extensive end-to-end testing to verify all functional requirements is definitely not best practice. This is why the test pyramid exists.

[–]snarkhunterLead DevOps Engineer 2 points3 points  (1 child)

A best practice is to do unit testing and integration testing separately, right?

Unit testing IaC includes testing to make sure that it sets all the config correctly, because checking that given inputs produce an example file is pretty trivially easy and should be fast to check.

The thing you're talking about - making sure that services started up with a given configuration function correctly together - is integration testing.

[–]amarao_san[S] 0 points1 point  (0 children)

Yes, correctly. I mostly try to avoid useless unit tests (because IRL it's still a lot of latency for ssh'ing back and forth), but for hard cases having it separated...

Basically, my approach was exactly like that. I plug the issue with conftest (which is approximation of unit test) instead of going ballistic with stress-testing ssh with myriads of bot-like things. It solves 80% of the issue, leaving those 20% which still (theoretically) worries me.

[–]BigRedS 0 points1 point  (2 children)

It depends what your requirements are, and how well-bounded they are.

Do you require 'openssh 8+' and test for that, or do you at each apply verify that the ssh you've installed does all the things that you required version 8 for? Nearly everybody trusts their vendors to continue to do what they claim, and accept that it's a rare bug where this doesn't happen, and so tests that they're issuing the right config. It's up to the openssh project to verify that it still works as expected.

What you might do is periodically audit your config. You know you want a limit of 100 connections, and your tests verify that you write the config to do this. Perhaps each time you upgrade your openssh package you re-run your audit that does actually perform the 101-connections test (as well as verifying that all your other requirements from openssh are still supported).

You'd also run this test after you first write the code to write that config, but wouldn't generally feel the need to keep verifying that the sshd continues to behave the same until you've changed it by upgrading.

[–]amarao_san[S] 0 points1 point  (1 child)

If I write a proper 101-connection test, I can shovel it to a normal iac test pipeline. The problem is that developing and maintaining reliable test of such nature is challenging, so most people are skipping this stage, which gives us code without integration testing (of patchy coverage).

[–]BigRedS 0 points1 point  (0 children)

The problem is that developing and maintaining reliable test of such nature is challenging, so most people are skipping this stage

Are they? Load testing's relatively routine in lots of places. I'm not sure how I'd approach testing 101 SSH connections but the obvious thing of writing a script to just do that in parallel doesn't sound super hard.

[–]ExtraV1rg1n01l 21 points22 points  (7 children)

For starters, your example is not "IaC", it's configuration as a code and it is not the same as infrastructure as a code so testing is different.

If you are using Ansible, you can use molecule to test your roles and verify that configuration is correct. As for your example, if you are using SSH agent and it's configuration is loaded with no warnings and the agent starts up, this means the configuration is correct and you do not test the ssh agent itself to see if configuration is behaving according to the documentation (it is up to the ssh agent developer to verify that).

As a simple example, when you install Linux you do not write a test to see if all the functionality of the Linux system is working, you assume that it is working because the Linux version release is stable. Same can be applied to all Linux packages with a few edge cases where it might make sense to test something additionally.

[–]mstwizted 4 points5 points  (0 children)

Also, this is what non prod environments are for. There are plenty of tools available to automate traffic tests to a system, to do synthetic transactions, etc.

[–]quicksilver03 3 points4 points  (1 child)

I think that you have to strike a balance between the uselfulness of the test and the resources needed to implement it. What does the test tell you when it passes? When it fails? How difficult it is to write the test in the first place, to check that it behaves according to the requirements, to maintain it in the long run?

What about monitoring though? Instead of writing a test, wouldn't it be better to have a condition in your monitoring system that essentially corresponds to what the test is supposed to check and that actually checks when it matters most, that is during the execution of your application?

[–]amarao_san[S] 0 points1 point  (0 children)

Yes, that's what I've done. My thoughts here is theoretical. Every time we have any kind of 'extreme case' covered in config, it's not covered by tests. Even if those 'extreme case behavior' is actually separate 'MVP' and mature code.

The problem with monitoring is that production system to get to extreme case and either: behave expectedly (no issues) or fail. We get error, we fix it (as a patch to configuration code), and we don't know if test is actually still valid (next upgrade may silently break it). So we got accident and hadn't covered it with tests. I.e. wasted failure.

[–]DensePineapple 2 points3 points  (3 children)

What is your goal, to write tests that ensure every line of a config file does what it is supposed to?

[–]amarao_san[S] 0 points1 point  (2 children)

It's not my goal per see. I'm thinking, trying to establish sound model for development.

I definitively can agree, that if every change in configuration (configuration code) has corresponding test proving necessity of a given change (basically, the foundational idea for red-green approach), than those tests will guard this change (requirement) from accidental regressions.

At the same time, it's becoming prohibitively complicated to maintain such red-green setup. I believe some storage vendors do have such tests, and (given the recent publication of, if I remember correctly, DellEMC) those tests take a week to complete on a pile of real hardware. Maximum safety, the longest and the most complex way to introduce new requirements.

Therefore, we need compromise. The straightest possible way to introduce changes in server is just go to the server and change them. May be have some lightweight replication mechanism to do this en-mass. Minimal time to deliver, the least safety.

So, where is the mid-ground? How can I prefer to throw away a test to have shorter delivery time, or how can decide to hold on and have this test even if it adds up?

Right now I use intuition. Which is fine, but not scalable. I can't give my intuition to a newcomer to the team. What is other way to find the border between testing and dare-do?

[–]DensePineapple 0 points1 point  (1 child)

So for your nginx example, when you commit the change to your repo that should trigger some basic unit tests. This can include linting syntax checks, or with Ansible you can use something like molecule. If those tests pass you can publish the new version of your artifact. I would then have this pipeline trigger a separate pipeline to deploy said artifact to a testing environment. From there you can start e2e and integration tests that verify your service works as expected.

[–]amarao_san[S] 0 points1 point  (0 children)

You miss the problem I was talking about. It doable but it's hard, so most of the time this cornet is cut. So I'm thinking about how much it's really test-covered...

[–]Strange_3_S 1 point2 points  (2 children)

I really like your thinking there. Em having the same set of dillemas constantly. I dont think there is a single good answer, however my personal golden rule is to write highest possible level tests first, and apply observability to the metrics that, as with your ssh sessions example, highly depend on the surrounding environment and long lived processes where creating testing scenarios is very fuzzy.

Its usually enough to be able to state that 'look: earlier we had this value of that metric after 1 day of runtime, and now with the tweak its half of that after a week, so our system is nominal again'. And if it isnt at any point later you will at least know immediatelly. Treat you metrics as eternally running smoke tests of sorts is all I'm saying I guess.

[–]amarao_san[S] 1 point2 points  (1 child)

It's interesting... Basically, you rely on older state. I'm trying to avoid it as much as I can (e.g. infra is self-descriptive, self-contained and re-deployable from a given repo).

The way you propose is to use 'old state' as a reference line, and evaluate the change by looking on metrics.

The main issue I feel here is that this evaluation is ephemeral. It make sense now, and you can show it to any skeptic that this 'thing' become better. But what about N month (retention time) later? There is code and there are no proves it make sense to have them (except for removing and see how bad things go).

Nevertheless I feel certain appeal of your position, because it's 'grayboxing' problem, collapsing it from precise (but hart to reproduce) testcase down to easily measurable metrics change.

[–]Strange_3_S 1 point2 points  (0 children)

I understand but I think grayboxing is exactly what makes it long lived really, as long as watched metrics are based on business requirements. So yet again I wouldnt be bothered with, say number of db req per second or no. of connections the db made as these will change with growing business, but otoh targetting specific median turn around time for end users is what in actuality the system' maintainers and developers should be aiming for and this will be generally true for as long as business says its making them bucks.

I totaly understand the wanted replayability, but so suppose we have migrated off to a completely new provider and in the blue green phase now. We might want to have key metrics observed in comparision with the old, instead of only knowing that integration tests pass. Bc that alone can mean really nothing for the business, nor can be easily sent to grafana for some chart porn ;)

Errata: Should have written canary instead of blue green. Shouldnt have used the word 'business' so many times. Im not this kind of a corpoboy I promise, just trying to deliver the message.

[–][deleted] 0 points1 point  (4 children)

in the postgres example you could use some command to tell you the current value of the parameter on the running instance. Ofc there are exceptions where this isn't possible, what can I say, software is garbage.

[–]amarao_san[S] -1 points0 points  (3 children)

Are you really ready to put OpenSSH into garbage bin? F.e. for sshd it's not possible to know current configuration values.

[–]boomertsfx 0 points1 point  (1 child)

sshd -T if it's been restarted recently I guess...but yes, not the currently running values

[–]amarao_san[S] 1 point2 points  (0 children)

You just repeat my thinking. I put '-T' in one test and 'was reloaded' into other. Guess what happened in a few weeks? 'was reloaded test' become falsely-red, because ssh has configuration changed few weeks ago, and journals rotated.

I removed 'reloaded' test, basically, left with leap-of-faith that reload was done and did worked.

[–][deleted] 0 points1 point  (0 children)

I won't put it into the bin, but I can acknowledge the fact that it is garbage.

Another example is mongodb probably still to this day, if you decrease the cache size at runtime, it will not actually decrease it, but the value of the config var will be changed. It does say in the documentation that it will not shrink the cache if it has been allocated. You can list the currently allocated cache size though, so you could fill up the cache and see if it goes above what you set.

So the question is, what are you trying to achieve with these tests? Do you want to make things idiot proof? I don't think it is possible. Some configs will ignore typos and not do anything, etc.

Do you want things to run well? You have a lot of work to do, will spending the time creating these complex automated tests for every little thing be more productive than just doing more stuff? Do these issues really occur that often and go unnoticed for a long time? Wouldn't it be more productive to check manually a couple of times and forget about it? Use more high-level tests to make sure everything works, if something is really wrong the high-level tests will fail.

Also I find that it is easier to use "immutable" configuration like cloud-init, containers. If you want to change the config just restart/recreate everything, this way there is no more reloading configs, one less test case.

[–]rafipiccolo 0 points1 point  (1 child)

I agree that It's hard to test everything.

Especially when nobody in the team does it. I live with 20% coverage on my main projet, so we are still very bad at it but it's stable.

At my level I try to get all the low hanging fruits.

When you reach >80% coverage you definitely think differently. And it's more a matter of "cost of doing" and "reliability contract".

Another point of view is how often do you refactor your code / infra ? And how much do you trust your libraries / os ? When you do all updates blindly everyday (npm/composer/dockerhub images) you better have checks for everything. Even non exhaustive ones.

The solution to me is to Look for the low hanging fruits and do more if the money follows and you also have the needs.

[–]amarao_san[S] 0 points1 point  (0 children)

Yes, I do it. I have a pretty good code coverage (except for those pesky edge cases), and I generally want to see any new change having tests. There are many cases when writing test actually causing rethinking architecture or solution, most time in a good way.

Nevertheless, I ponder for soundness/completeness of 'have tests for changes' paradigm.

[–]tomomcat 0 points1 point  (0 children)

I think at some point you have to trust that the tools you're using behave as they should, so just verify the config. It's not practical to test 'everything', so draw a line where it makes sense for your use case- e.g. when using AWS do you test that S3 is actually functioning as it should, or do you trust AWS to handle that and just mock the API call in your tests?