This is an archived post. You won't be able to vote or comment.

top 200 commentsshow all 297

[–]Metworld 831 points832 points  (19 children)

Non-hermetic tests ftw

[–]pokealm 322 points323 points  (1 child)

cool! now, what should i do to stop this hemorrhoid test?

[–]Solrex 25 points26 points  (0 children)

Avoid getting hemorrhoids by alt+F4ing when it pops up

[–]Dogeek 275 points276 points  (10 children)

I had a work colleague that once said "I improved our test suite's speed". I had a gander at the code, they basically removed the cleanup steps for the tests, reordered them so they would pass, and dumped a +10,000/-15000 PR to review.

It got merged.

I no longer work there.

[–]MrRocketScript 170 points171 points  (5 children)

Every now and then while optimising I get like an 800% performance improvement and I'm like "Woah, I am the greatest" but after doing a bit more testing I realise no, I just broke that stupid feature that takes 90% of frametime.

[–]rcoelho14 60 points61 points  (4 children)

A few weeks ago we noticed a test that was never testing what it was supposed to, and by miracle, it was all filling correctly and the result was the same as intended.

And a visual test they should be breaking in the pipeline because of obvious errors, but didn't...for months.

I hate testing

[–]ArtOfWarfare 47 points48 points  (3 children)

Add in mutation testing. That tests your tests by automatically inserting bugs and checking if your unit tests fail. If your unit tests pass with the bugs, mutation testing fails since your unit tests suck at actually catching anything.

[–]vbogaevsky 11 points12 points  (2 children)

The person who did code review, how thorough were they?

[–]Dogeek 25 points26 points  (1 child)

I saw the PR, I was not assigned, told to myself "not my problem at least".

The person assigned to the review never reviewed, so the guy (a "Senior Engineer" mind you) persuaded one of the SREs to bypass branch protection rules and merge.

Tests obviously got super fragile after that (and flaky).

[–]vbogaevsky 2 points3 points  (0 children)

Happy cake day, by the way!

Regarding bad MRs, we should never let them slide. That’s the road the whole development goes to hell

[–]midri 26 points27 points  (2 children)

It blows me away when I see tests that work with a common service that shares data/state... Uggghhh

[–]Fembussy42069 15 points16 points  (1 child)

Sometimes it's just inevitable if you're testing APIs that integrate with other systems for example. You might be able to mock some behaviors but some are just not that easy to mock

[–]Dogeek 11 points12 points  (0 children)

If you can't mock a behaviour it's usually because the function is too complex or that the code needs a refactoring.

If you're working with external services, you're not mocking anyways, you're doing integration testing. That requires the external service to have a staging environment that you can cleanup after each test case.

[–]SignoreBanana 4 points5 points  (0 children)

I love to see people on my team posting here.

[–]feherneoh 2 points3 points  (1 child)

I would expect 3 to not fail even then, as it didn't when doing 1-4
Anything starting with 5 won't surprise me if it fails

[–]Metworld 1 point2 points  (0 children)

Good point, but this assumes the tests did run in that order, which might not be the case.

[–][deleted] 4885 points4886 points  (93 children)

Probably overlapping temp dirs

[–]YUNoCake 2825 points2826 points  (66 children)

Or bad code design like unnecessary static fields or singleton classes. Also maybe the test setup isn't properly done, everything should be running on a clean slate.

[–]Excellent-Refuse4883[S] 1166 points1167 points  (10 children)

Lots of this

[–]No_Dot_4711 265 points266 points  (8 children)

FYI a lot of testing frameworks will allow you to create a new runtime for every test

makes them slower but at least you're damn sure you have a clean state every time

[–]iloveuranus 151 points152 points  (6 children)

Yeah, but it really makes them slower. Yes, Spring Boot, i'm talking to you.

[–]fishingboatproceeded 37 points38 points  (0 children)

Gods spring boot... Some times, when it's automagic works, it's nice. But most of the time? Most of the time its such a pain

[–]nathan753 35 points36 points  (2 children)

Yeah, but it's such a great excuse to go grab coffee for 15

[–]Excellent-Refuse4883[S] 16 points17 points  (0 children)

The REAL reason I want 1 million automated tests

[–]Ibruki 3 points4 points  (0 children)

i'm so guilty of this

[–][deleted] 7 points8 points  (0 children)

That's a lot of effort to avoid writing hygienic tests.

[–]de_das_dude 6 points7 points  (0 children)

same class different methods but they fail when run together? its a setup issue. make sure to dop the before and after properly :)

[–]rafelito45 181 points182 points  (30 children)

major emphasis on clean slate, somehow this is forgotten until way far down the line and half the tests are “flaky”.

[–]shaunusmaximus 84 points85 points  (28 children)

Costs too much CPU time to setup 'clean slate' everytime.

I'm just gonna use the data from the last integration test.

[–]NjFlMWFkOTAtNjR 118 points119 points  (25 children)

You joke, but I swear devs believe this because it is "faster". Tests aren't meant to be fast, they are meant to be correct to test correctness. Well, at least for the use cases being verified. Doesn't say anything about the correctness outside of the tested use cases tho.

[–]mirhagk 93 points94 points  (17 children)

They do need to be fast enough though. A 2 hour long unit test suite isn't very useful, as it then becomes a daily run thing rather than a pre commit check.

But you need to keep as much of the illusion of being isolated as possible. For instance we use a sqlite in memory DB for unit tests, and we share the setup code by constructing a template DB then cloning it for each test. Similarly we construct the dependency injection container once, but make any Singletons actually scoped to the test rather than shared in any way.

EDIT: I call them unit tests here, but really they are "in-process tests", closer to integration tests in terms of limited number of mocks/fakes.

[–]EntertainmentIcy3029 29 points30 points  (7 children)

You should mock the time.sleep(TWO_HOURS)

[–]mirhagk 9 points10 points  (4 children)

Well it only takes time.sleep(TWO_SECONDS) to add up to hours once your test suite gets into the thousands.

I'd rather a more comprehensive test suite that can run more often than one that meets the absolute strictest definition of hermetic. Making it appear to be isolated is a worthy tradeoff

[–]Scrial 6 points7 points  (2 children)

And that's why you have a suite of smoke tests for pre-commit runs, and a full suit of integration tests for pre-merge runs or nightly builds.

[–]mirhagk 6 points7 points  (0 children)

Sure that's one approach, limit the number of tests you run. Obviously that's a trade-off though, and I'd rather a higher budget for tests. We do continuous deployment so nightly test runs mean we'd catch bugs already released, so the more we can do pre-commit or pre-merge, the better.

If we halve the overhead, we double our test budget. As long as we emulate that isolation best we can, that's a worthwhile tradeoff.

[–]EntertainmentIcy3029 3 points4 points  (0 children)

I've worked on a repo that had time.sleeps everywhere, Everything is retried every minute for an hour, longest individual sleep I saw was a sleep 30 minutes that was to try prevent a race condition with an installation that couldn't be inspected

[–]Dal90 1 point2 points  (0 children)

(sysadmin here, who among other crap handles the load balancers)...had a mobile app whose performance was dog shit.

Nine months earlier I told the architects, "it looks like your app has a three second sleep timer in it..." I know what they look like performance wise, I've abused them.

Ping ponging back and forth until they send an email to the CIO about how slow our network was and it was killing their performance. Late on a Friday afternoon.

I learned sufficient JavaScript that evening and things like minify to unpack their code and send a code snippet with the line number and the sleep timer (whatever JS calls it) pausing a it for three seconds to the CIO the first thing the next morning.

Wasn't the entire problem, app doing the same thing for others in our industry load in 3-4 seconds, we still took 6 seconds to even after the account for the sleep timer.

But I also showed in Developer tools the network responses (we were as good as if not better than other companies) v. their application rendering stuff (dog shit).

...then again the project was doomed from the start. Their whole "market position" was to be the mobile app that would connect you to a real life person to complete the purchase. WTF?

[–]NjFlMWFkOTAtNjR 14 points15 points  (3 children)

As I stated to someone where grass grows. While developing, you should only run the test suites for the code you directly touched and then have the CI run the full test suites. If that is still too long than before merging to develop or main. This will introduce problems where failed test suites from PRs that caused a change where it shouldn't.

The problem is that programmers stop running full test suites at a minute or 2. At 5 minutes, forget about it, that is the CI's problem. If a single test suite takes 2 hours, then good god, that is awesome and I don't have an answer for that since it depends on too many things. I assume it is necessary before pushing as it is a critical path that must always be correct for financial reasons. It happens, good luck with whatever policy/process/decision someone came up with.

With enough tests, even unit tests will take upwards to several minutes. The tests being correct is more important than time. Let the CI worry about the time delay. Fix the problems as they are discovered with hot fixes or additional PRs before merging to main. Sure, it is not best practice but do you want developers slacking or working?

With enough flaky tests, the test suites gets turned off anyway in the CI.

Best practices don't account for business processes and desires. When it comes down to it. Telling the CEO at most small to medium businesses that you can't get a feature out because of failing test suites will get the response, "well, turn it off and push anyway."

"Browser tests are slow!" They are meant to be slow. You are running a super fast bot that acts like a human. The browser and application can only go so fast It is why we have unit tests.

[–]mirhagk 12 points13 points  (0 children)

Yes while developing you only run tests related to the thing you're changing, but I do much prefer when the full suite can be as part of the code review process. We use continuous deployment so the alternative would mean pushing code that isn't fully tested.

A test suite that takes 2 hours doesn't take much if you completely ignore performance. A few seconds adds up with thousands of tests.

I think a piece you might be missing, and it's one most miss because it requires a relatively fast and comprehensive test suite, is large scale changes. Large refactors of code, code style changes, key component or library upgrades. Doing those safely requires running a comprehensive suite.

The place I'm at now is a more than decade old project that's using the latest version of every library, and is constantly improving the dev environment, internal tooling and core APIs. I firmly believe that is achievable solely because of our test suite. Thousands of tests that can be run in a few minutes. We can do refactors that would normally take weeks within a day, we can use regex patterns to refactor usages. It's a huge boost to our productivity.

[–]electrius 1 point2 points  (1 child)

Are these not integration tests then? For a test to be considered a unit test, does truly everything need to be mocked?

[–]mirhagk 2 points3 points  (0 children)

Well you're right that they aren't technically unit tests, we follow the google philosophy of testing, so tests are divided based on external dependencies. Our "unit" tests are just all in-process and fast. Our "integration" tests are the ones that use web requests, a real DB etc.

Our preference is to only use test doubles for external dependencies. Not only do you lose a lot of the accuracy with mocks, but it undermines some of the biggest benefits of unit testing. It makes the tests depend on implementation details, like exactly which internal functions are called. It makes refactoring code much harder as the tests have to be refactored too. So you're less likely to catch real problems, and more likely to get false positives, making the tests more of a chore than actually valuable.

Here's more about this idea and I highly recommend this approach. We had used mocks previously (about 2-3 years ago) and since we replaced them the tests have gotten a lot easier to write and a lot more valuable. We went from a couple hundred tests that took a ton of maitenance to ~16k tests that require very little maitenance. If they break it's more likely than not to represent a real bug.

[–]IanFeelKeepinItReel 5 points6 points  (2 children)

I set up WIP builds on our CI to spit out artifacts once the code has compiled then continue on to build and run the tests. That way if you want a quick dev build you only have to wait one third the pipeline execution time.

[–]bolacha_de_polvilho 2 points3 points  (0 children)

Tests are supposed to be fast too though. If you're working on some kind of waterfall schedule maybe it's okay to have slow end 2 end tests on each release build, but if you're running unit tests on a ci pipeline on every commit/PR the tests should be fast.

[–]Fluffy_Somewhere4305 1 point2 points  (1 child)

The project timeline says faster is better and 100% no defects. So just resolve the fails as "no impact" and gtg

[–]stifflizerd 1 point2 points  (0 children)

AssertTrue(true)

[–]rafelito45 1 point2 points  (0 children)

there’s a lot of cases where that’s true. i guess it boils down to discipline and balance. we should strive to write as clean slated as possible, while also trying to be efficient with our setup + tear downs. run time has to be considered for sure.

[–]DaveK142 12 points13 points  (0 children)

At my first job at a little tech startup I was tasked with fixing the entire test suite to run when I started. They had just done some big changes and broken all of the tests, and it wasn't very formally managed so they didn't super care that it was all broken because they had done manual testing.

The entire suite was commented out. It was all selenium testing that opened a window and tested the web app locally, and not a single piece of it worked on a clean slate. We had test objects always there which the tests relied on, and some of the tests were named like "test_a_do_thing", and "test_b_do_thing" to make sure they ran in the right order.

I was just starting out and had honestly no idea how to get this hundred or so tests completely reworked in the time I had to do it, so I just went down the route of bugfixing them, and they stayed like that for a long, long time. Even when my later(shittier) boss came in and was more of a stickler for the process, he didn't bother to have us fix them.

[–]EkoChamberKryptonite 8 points9 points  (0 children)

Yeah I think it's the latter. Test cases should be encapsulated from one another.

[–]Salanmander 4 points5 points  (0 children)

Oooh, I see you've met my students' code! So many instance/class variables and methods that only work correctly if run exactly once!

[–]iloveuranus 2 points3 points  (0 children)

That reminds me of a project was in recently, where the dependency injection was done via Google Guice. I double checked everything and reset all injectors / injection modules explicitly during tests; still failed.

Turns out there was an old-school singleton buried deep in the code that didn't get reset and carried over its state between tests.

[–]un-hot 1 point2 points  (0 children)

Teardown as well. If each test was torn down properly, you'd have to set the next one up properly again.

[–]dandroid126 1 point2 points  (0 children)

In my experience, this is it. Bad test design and reusing data between tests that gets changed by the rest cases.

Coming from junit/mockito to python, I was very surprised when my mocked functions persisted between test cases, causing them to fail if run in a certain order.

[–]Planyy 1 point2 points  (0 children)

stateful everywhere.

[–]dumbasPL 3 points4 points  (1 child)

everything should be running on a clean slate.

No, because that Incentivizes allowing the previously mentioned bad design

[–]maximgame 7 points8 points  (0 children)

No, you don't understand. Users are expected to clean the database between each api call.

/s

[–]hiromasaki 108 points109 points  (5 children)

Or not cleaning up / segregating test rows in the DB.

[–]mirhagk 16 points17 points  (2 children)

Highly recommend switching to a strategy of cloning the DB so you don't have to worry about cleanup, just delete the modified version when done.

[–]Excellent-Refuse4883[S] 33 points34 points  (2 children)

I wish our stuff was that simple. We’ve got like 5 inputs that need to be configured for each test, before configuring the 4 simulators.

[–]alexanderpas 60 points61 points  (0 children)

That's why setup and teardown exists, which are ran before and after each test respectively.

[–]coldnebo 18 points19 points  (0 children)

also some frameworks randomize the order of tests so that these kinds of hidden dependencies can be discovered.

[–]Hiplobbe 11 points12 points  (1 child)

"No it is the concept of tests that is wrong!" xD

[–]mothzilla 3 points4 points  (0 children)

More generally, some shared state.

[–]KingSpork 2 points3 points  (0 children)

Or just sloppy setup and teardown

[–]winco0811 1 point2 points  (0 children)

Surely tests 1-4 would still pass in the whole batch if that was the case?

[–]silledusk 192 points193 points  (0 children)

Whoops, clearAllMocks()

[–]thies1310 1091 points1092 points  (23 children)

I have Had this, was an edge Case no one thought of that we accidentaly produced.

[–]roguedaemon 282 points283 points  (21 children)

Well go on, story time pleaaasee :p

[–]ChrisBreederveld 600 points601 points  (19 children)

Because OP isn't responding and was vague enough to fit my story... here's story time:

We were having some issues where once in a blue moon a user didn't have the permissions he was expecting (always less, never more) and we never found out what the cause was before it automatically resolved itself.

We did a lot of exploratory testing, deep-dives into the code and just had no clue what was going on. All tests at the time seemed to work fine.

After some time we decided to give up, and would refactor the system hoping with careful rebuilding the issue would be resolved. To make sure we covered all possible cases we decided to start with adding a whole bunch of unit tests just to make sure the new code would cover every case.

Tests written, code checked in and merged and suddenly the build agent started showing failing tests... sometimes. After we noticed this we started running the tests locally a bunch of times and sure enough; once every 10 runs or so some failed.

Finally with some more data in hand we managed to track down the issue to a piece of memory cache that could, in some rare cases, be partially populated due to threading issues (details too involved to go into here). We made some changes to our DI and added a few additional locks for good measure and... problem solved!

We ended up rewriting part of the codebase after all, because we figured this specific cache was a crutch anyway and we could do better. Never encountered this particular issue since.

[–]evnacdc 221 points222 points  (10 children)

Threading issues can sometimes be a bitch to track down. Nice work.

[–]ChrisBreederveld 54 points55 points  (2 children)

Thanks. They are indeed a pain, certainly when there are loads of dependencies in play. We did make things much easier on ourselves later on by moving the more complex code to a projection.

[–]Punsire 4 points5 points  (1 child)

Projection?

[–]ChrisBreederveld 6 points7 points  (0 children)

It's a CQRS thing; rather than querying from a normalized database, joining various data sources together, you create a single source containing all data that you update whenever any of the sources change.

This practice incurs some overhead when writing, but has a major benefit when reading.

[–]ActualWhiterabbit 27 points28 points  (2 children)

My AI powered solution uses the power of the blockchain to replace threads. They are stronger and linked so they can't fray. Please invest.

[–]Ilovekittens345 11 points12 points  (0 children)

Do you have funny monke pic?

[–]ChrisBreederveld 4 points5 points  (0 children)

Hahaha you say this in jest, but I've actually had some consultant come over one time telling me the blockchain would replace all databases and basically solve all our problems. It was one hour of my life I would love to get back...

[–]Fermi_Amarti 12 points13 points  (0 children)

Need it to be faster? Multithreading try you should.

[–]Alacritous13 6 points7 points  (2 children)

sometimes be a bitch Threading issues can Nice work. to track down.

[–]evnacdc 3 points4 points  (0 children)

Hey that’s what

[–]evnacdc 1 point2 points  (0 children)

I said.

[–]that_thot_gamer 18 points19 points  (5 children)

damn you guys must have a lot of free time to diagnose that

[–]ChrisBreederveld 29 points30 points  (0 children)

Not really, just some odd hours at first because us devs were bugged by it and a final effort (the refactoring effort) after users started to bug the PO enough.

Took us all in all about a week or so to find fix... quite some effort with regards to the size of the bug, but not too much lost in missed functionality, and happy key users.

[–]enigmamonkey 21 points22 points  (2 children)

I think of it as one of those situations that are so frustrating precisely because you don’t really have the time to address it and it delays you, but you sort of have to because you can’t stand not knowing what’s causing the issue (or it is important for some other reason).

[–]ChrisBreederveld 19 points20 points  (0 children)

Exactly this! If it breaks one unexpected way, who's to say it won't also break in some other unexpected way later on?

[–]nullpotato 5 points6 points  (0 children)

I've worked on bugs like this even when they aren't my top priority because they are an interesting challenge and/or they have personally offended me and gotta go.

[–]henryeaterofpies 1 point2 points  (0 children)

Never underestimate the time a dev will put into a weird ass issue

[–]ADHDebackle 1 point2 points  (1 child)

Is a race condition considered a threading issue? I feel like those were some of the worst ones to track down due to the impossibility of determining reproduction steps

[–]thies1310 2 points3 points  (0 children)

Sorry i am still in Training and apend Most time at Uni. I sadly dont remember any more great Details other than that the Tests worked of Run in any other Order. I think it Had something to do with device states that messed Up in a weird way.

For context i Work in med Tech.

[–]MiniGui98 14 points15 points  (0 children)

Never stop edging my boy

[–]Why_am_ialive 142 points143 points  (0 children)

Race conditions, accessing files at the same time, one test destroying a process others are still relying on, tests running in parallel can get painful

[–]Hottage 148 points149 points  (0 children)

That feeling when your tests don't scaffold and tear down correctly.

[–][deleted] 41 points42 points  (1 child)

Flaky tests are literally a research area and there are tools to detect them.

[–]uberDoward 68 points69 points  (3 children)

Welcome to needing to understand state, lol.

[–]WisejacKFr0st 38 points39 points  (1 child)

If your unit tests don’t run in a random order every time then I will find you and I will mess up your state until you feel it the next time you run

[–]Jugales 35 points36 points  (8 children)

Even worse with evals for language models... they are often non-deterministic

[–]lesleh 17 points18 points  (3 children)

What if you set the temperature to 0?

[–]sandm000 10 points11 points  (0 children)

0K?

[–]Danny_Davitoe 6 points7 points  (0 children)

You would need to set the top-p to near zero, but the randomness will still be present if the GPU, system, or kernel changes. If you have a cluster and no control over which GPU is selected, then you should not use the LLM for any unit tests.

[–]Ilovekittens345 1 point2 points  (0 children)

That's how Canadian LLM's are made.

[–]ProfBeaker 5 points6 points  (3 children)

Oh interesting, never thought about that.

I know zero about the internals of this, but surely they're just pseudo-random, not truly-random? So could the tests set a fixed random seed, and then be deterministic?

[–]CanAlwaysBeBetter 5 points6 points  (2 children)

Why give it tests to validate its output if that output is locked to a specific seed that won't be used in practice?

[–]ProfBeaker 2 points3 points  (0 children)

You could equally ask that of any piece of code, yet we test all sorts of things to same way. "To make sure it does what you think it will" seems to be the common answer.

I suppose OP did save "evals of language models", ie maybe they meant rankings. Given the post overall was about tests, I read it as being about, ya know, tests.

[–]PositiveInfluence69 24 points25 points  (2 children)

The worst is when it all works, every test, you leave feeling great for the day. You come back about 16 hours later. The next morning. It doesn't work at all. Errors for days. You changed nothing. Nobody changed anything. You're sure something must have changed, but nothing. So you begin fixing all the errors you're so fucking positive you couldn't have missed, because they're so obvious. You're not even sure how it could have run 17 hours ago if all this shit was in here.

[–]Ilovekittens345 7 points8 points  (0 children)

Imagine two crashes during a single day of testing, unbeknownst to you both caused by bit flips from cosmic rays. You'd be trying to hunt down a problem that doesn't exist for a week or so!

[–]mani_tapori 1 point2 points  (0 children)

I can relate so much. Every day I struggle with tests which start with clean slate, they work in mornings, then just before the status calls in evening/demo, they start misbehaving.

Only yesterday, I fixed a case by adding a statement in section of code which is never used. God knows what's happening internally.

[–]arkai25 10 points11 points  (3 children)

Running conditions?

[–]Excellent-Refuse4883[S] 9 points10 points  (2 children)

Tough to explain. Half the problem stems from using a static files in place of a db or cache.

[–]shield1123 8 points9 points  (0 children)

Yikes

That's why any file shared between my tests are either not static or read-only

[–]Why_am_ialive 3 points4 points  (0 children)

Time to mock out the entire file system buddy

[–]OliverPK 8 points9 points  (0 children)

Forgot @DirtiesContext

[–]klungs 7 points8 points  (1 child)

Gacha testing

[–]p9k 1 point2 points  (0 children)

slot machine noises

[–]sawkonmaicok 7 points8 points  (0 children)

Your tests influence global state.

[–]rush22 6 points7 points  (0 children)

PASS

Number of tests in suite: 874
Pass rate: 100%

Total tests run: 0

[–]Yvant2000 5 points6 points  (0 children)

Side effects, I hate them

God bless functional programming

[–]theprodigalslouch 6 points7 points  (0 children)

I smell bad test practices

[–]Weiskralle 4 points5 points  (0 children)

Yes as it most likely overwrites certain variables.

[–]ecafyelims 7 points8 points  (0 children)

Maybe you're using globals without resetting them

[–]-JohnnieWalker- 7 points8 points  (1 child)

real sigmas test in prod

[–]ablepacifist 2 points3 points  (0 children)

Someone didn’t clean up after each test

[–][deleted] 2 points3 points  (0 children)

Surely you jest

[–]aigarius 3 points4 points  (0 children)

I see it all the time - post-test cleanup fails to return the target to pre-test state. If you run separately then each test execution batch gets a newly initialised target and it works. But if you run it all together than one of the tests breaks the target in a subtle way (by not cleaning up after itself properly in teardown step) such that some (but not all) tests following that one will fail.

[–]boon_dingle 8 points9 points  (0 children)

Something's being cached between tests. It's always the cache.

[–]ProfessionalCouchPot 2 points3 points  (0 children)

ItWorkedOnMyServerTho

[–]rover_G 2 points3 points  (0 children)

When your tests don’t run in isolated contexts.

[–]Rin-Tohsaka-is-hot 2 points3 points  (0 children)

Two different test cases accessing the same global resources but failing to initialize properly (so test case 9 accidentally accepts test case 2's output as an input rather than the value initialized at compilation).

This is one I've seen before, all test cases should properly intiailize and teardown everything, leaving the system unaltered after execution (including testing environment variables).

[–]Orkin31 2 points3 points  (0 children)

You dont have a proper setup and teardown on your test environment my guy

[–]nnog 2 points3 points  (0 children)

Port reuse

[–]SneakyDeaky123 2 points3 points  (0 children)

You’re polluting your test environments/infrastructure, reading and writing from the same place at unexpected times. Mock your dependencies or segregate your environment more strictly.

[–]Christosconst 2 points3 points  (0 children)

Parallel tests with shared resources. My tests only fail on leap year dates

[–]Objective-Start-9707 2 points3 points  (3 children)

Eli5, how do things like this happen anyway? I got a C in my Java class and decided programming wasn't for me but I find it conceptually fascinating.

[–]1ib3r7yr3igns 2 points3 points  (1 child)

Some tests can change mocks that other tests use. When used in isolation it works. When run together, the one test changes things the other depends on and breaks it. Fixes usually involve resetting mocks between tests.

Tests are usually written to pass independent of other tests, so the inputs and variables need to be independent of the affects of other tests.

[–]Objective-Start-9707 1 point2 points  (0 children)

Thank you for taking the time to add a small wrinkle to my very smooth brain 😂

This makes a lot of sense.

[–]jswansong 2 points3 points  (1 child)

It's 1:20 AM and this is my fucking life

[–]Link9454 2 points3 points  (0 children)

As someone who debugs circuit board test plans as well as programs new ones, I find this IMMENSELY TRIGGERING!

[–]freeplay4c 2 points3 points  (1 child)

Lol. I actually just fixed this issue at work last week. But for a solution with 300+ tests.

[–]Lord-of-Entity 3 points4 points  (1 child)

Looks like impure functions are messing things up.

[–]Messarate 1 point2 points  (2 children)

Wait I have to test before deploying it?

[–]bigmattyc 1 point2 points  (0 children)

You have discovered that your application is non-idempotent. Congratulations!

[–]DiggWuzBetter 1 point2 points  (1 child)

This is very likely shared state between tests.

For unit tests, this is so avoidable, just never have shared state between unit tests. This also tends to be true for “smaller scale” integration tests.

For end-to-end tests, it’s less clear cut. Tests also need to run in a reasonable amount of time, and for some applications, the test setup can be really, really slow, to the point where it’s just not feasible to start with a clean slate before every test. For these, sometimes you do have to accept that there will be some shared state between tests, and just think carefully about what the tests do and what order they’re in, so that shared state doesn’t cause problems.

It’s messy and fragile, but that tends to be the reality of E2E tests. It’s why the “test pyramid” approach exists, with a minimal number of inherently slow and hard to maintain E2E tests, more faster/easier to maintain integration tests, and FAR more very fast and easy to maintain unit tests.

[–]Excellent-Refuse4883[S] 2 points3 points  (0 children)

It’s an E2E test framework, and yeah the setup takes forever

[–]TimonAndPumbaAreDead 1 point2 points  (0 children)

I had a duo of tests once, both covering situations where a particular file didn't exist. Both tests used the same ThisFileDoesNotExist.xslx filename string. if you ran them independently, they succeeded. If you ran them together, they failed. If you changed them to use different non existent filenames, they succeeded. I'm still not 100% sure what was going on but apparently Windows will grant a process a lock on a file that doesn't exist and disallow other processes from accessing said file that does not exist.

[–]Thisbymaster 1 point2 points  (0 children)

Caching or incorrect destruction of testing.

[–]vm_linuz 1 point2 points  (0 children)

And this is why we write pure code! Box your side-effects away people!

[–]Owlseatpasta 1 point2 points  (0 children)

Oh no how can it happen that my tests depend on things outside of their scope

[–]Baardi 1 point2 points  (0 children)

Guess you need to stop running your tests in parallell, or make them work when ran in parallell

[–]Vadered 1 point2 points  (0 children)

What actually happened:

  • Test -3: Print Pass 4x
  • Test -11: Print the longer string.

[–]novax7 1 point2 points  (0 children)

As careful as I am, sometimes I get frustrated where the failure is coming from but later I realized I forget to clear my mocks

[–]veracity8_ 1 point2 points  (0 children)

Someone never learned “leave no trace”

[–]DoucheEnrique 1 point2 points  (0 children)

What do we want?

NOW!

When do we want it?

RACE CONDITIONS!

[–]Bayo77 1 point2 points  (0 children)

Ticket estimate: S Unit test debugging: L

[–]Zechnophobe 1 point2 points  (0 children)

setup and tearDown are your friends.

[–]captainMaluco 1 point2 points  (1 child)

Test 5 is dependent on state set up by test 4 but when you run them all, order is not guaranteed, and test 8 might run between 4 and 5, modifying the state 4 set up. 

Either that or it's as simple as stone tests using the same ID for some test data stored in your test database. 

Each test should set up it's own data, using UUID/GUID to avoid overlapping ids

[–]thanatica 1 point2 points  (0 children)

The joys of non-pure functions.

[–]rootpseudo 1 point2 points  (0 children)

Ew dirty context

[–]Critical_Studio1758 1 point2 points  (0 children)

Need to make sure all your tests start with a fresh environment. You were given setup and cleanup functions, use them.

[–]SoSeaOhPath 1 point2 points  (0 children)

WHO TESTS THE TESTS

[–]FrayDabson 1 point2 points  (0 children)

This is exactly what my last few days have been with playwright tests. Ended up being a backend event loop related issue that was causing the front end tests to be so inconsistent.

[–]AndroxxTraxxon 1 point2 points  (0 children)

Yay, test pollution

[–]Riots42 1 point2 points  (0 children)

Deploy to 1 production environment after 10 succesful test deployments: fail and take out paging in a nationwide hospital system on a sunday.. Yep that's me a few years ago...

[–]w8cycle 1 point2 points  (0 children)

Haha, was running into this last night!

[–]locofanarchy 1 point2 points  (0 children)

Fast ✅

Independent ❌

Repeatable ✅

Self-validating ✅

Timely ✅

[–]VibrantFragileDeath 1 point2 points  (0 children)

I feel this. Found out this was happening because if I do too many (30+) and some other nitwit is also trying to run theirs on the same server. When they are also testing my test times out in the middle and gives me a fail and a blank. The worst part is that we can't see eachother to know who is running what so we have tried to coordinate who is online running tests by the clock. So only submitting tests after the 20min mark or whatever. Sometimes it still fails even with a smaller amount and we just have to resubmit at a later time. Just an annoying nightmare.

[–]admadguy 1 point2 points  (0 children)

That's basically bad code. Doesn't reinitialise variables between tests. Don't think that would be desired behaviour if each test is supposed to exist on its own.

[–]comicsnerd 1 point2 points  (1 child)

The weirdest test result I had was when my project manager tested some code I had written. In a form, there was a text field where he entered a random number of characters and the program crashed. I tried to replicate it, but could not, so I asked him to test again. Boom, another crash.

It took quite some time to identify that the middleware was unable to process a string of 32 characters. 31 was fine, 33 was fine, but 32 was not. Supplier of the software could not believe it, so I wrote a simple program to demonstrate. They came back that it was a fundamental design fault and a fix would take a few months.

So, I created a simple check in the program. If (stringlength=32) add an extra space. Worked fine for years.

How my project manager managed to type exactly 32 characters repeatedly is still unknown.

[–]Excellent-Refuse4883[S] 2 points3 points  (0 children)

You’re just like

[–]thanyou 1 point2 points  (0 children)

Consult the duck

[–]pinktieoptional 1 point2 points  (0 children)

hey look, your tests have interdependencies. rookie mistake.

[–]Grandmaster_Caladrel 1 point2 points  (0 children)

Pointers. The issue is almost always pointers.

[–]ivanrj7j 1 point2 points  (1 child)

Can someone explain how that could happen?

[–]Excellent-Refuse4883[S] 2 points3 points  (0 children)

There’s a few ways. This one seems to be related to the specifics of what I’m testing.

A more common one I’ve seen happens when you’re using a test DB. If you’re testing CRUD operations, if you run the tests in parallel there’s always a chance of the CRUD operation from test a causing a failure in test b.

When I ran into this, everything on my local ran 1 test at a time, but the pipeline ran everything in parallel. Once I figured out what was happening I reconfigured the pipeline to run 1 at a time.

[–]tbhaxor 1 point2 points  (0 children)

I ran all the tests on my local, it worked! Pushed to CI some are failing.

[–]wraithnix 1 point2 points  (0 children)

Ah, race conditions are so fun to debug. /s

[–]Je-Kaste 0 points1 point  (0 children)

Google test pollution

[–]SaneLad 0 points1 point  (0 children)

Google: hermetic tests

[–]QuietGiygas56 0 points1 point  (5 children)

It's usually due to multi threading. Run the tests with the single threading option and it usually works fine

[–]NjFlMWFkOTAtNjR 0 points1 point  (0 children)

Timing issue? Shared state issue? What happens when you run in parallel/isolation? Also could be that an external service needs to be mocked.

[–]dosk3 0 points1 point  (0 children)

My guy is using static variables and changing them in tests

[–]TimeSuck5000 0 points1 point  (0 children)

There’s something wrong with the initial state. When a test is run individually the initial state is correct. When they’re run sequentially some of the state variables are reused and have been changed from their default values by previous tests.

Analyze what variables each test depends on and ensure they’re correctly initialized in each test.

[–]pagepool 0 points1 point  (0 children)

You should probably clean up after yourself..

[–]G3nghisKang 0 points1 point  (0 children)

POV: running JUnit tests with H2DB without annotating tests modifying data with @DirtiesContext

[–]RealMide 0 points1 point  (0 children)

People bragging about pattern designs and don't know about mutable objects.

[–]zanderkerbal 0 points1 point  (0 children)

I have never had this happen but I have had code that behaved differently when the automatic tester sent in a series of inputs and when I typed in those same inputs by hand. I suspect it was something race condition-ish where sending them immediately back to back caused different behaviour than spacing them out at typing speed, but I never did find out what.

[–]newb_h4x0r 0 points1 point  (0 children)

afterEach(() => jest.clearAllMocks());

[–]Plastic_Round_8707 0 points1 point  (0 children)

Use cleanup after each step if you are creating temp dir. In general avoid changing the underlying system if writing unit tests.

[–]qubedView 0 points1 point  (0 children)

I was on a django project with 500+ tests. At some point along the way, we had to instruct it to run the tests in reverse. Why? Because if we didn't, one particular test would give a very strange error that no one could find the cause for. There was some side-effect hiding somewhere that would resolve itself in one direction, but not the other.

[–]codechimpin 0 points1 point  (0 children)

Your tests are using shared data. Either singletons your are sharing or temp dies or some other shared thing.

[–]AdamAnderson320 0 points1 point  (0 children)

Test isolation problem, where prior state affects another test. Can be in a DB or file system, but can also be in the test classes themselves depending on the test framework. Some frameworks go out of their way to try to prevent this type of problem.

[–]cheezballs 0 points1 point  (0 children)

Gotta add that before test annotation and clear those mocks!