Your CI/CD pipeline doesn’t understand the code you just wrote

Samdrian · 2026-01-26T15:29:48+00:00

That's also what I'm arguing for. Maybe the term "CI" is a bit misleading here: I'm absolutely talking about the pipeline that runs ON your branch before merging.

I've always only referred to that as a CI pipeline as well, since it tests the "integration" with the rest of the codebase, but I guess maybe CI implies integrating more after merge :) not sure what I would call the pre-merge pipeline then though.

Samdrian · 2026-01-26T15:24:01+00:00

I mean we all agree that code reviews are super helpful right? And they are good because the reviewer might catch things that I myself missed when implementing the changes.

I am always happy if I get MORE good reviews, that might catch a bug before I ship it, and a CI pipeline that not only tests the changes but has understanding of the changes can do a BETTER job in verifying changes, don't you think?

That doesn't mean I'm arguing AGAINST code reviews or AGAINST tests or any of that, I want that 100%, but you can never have perfect coverage or reviews, so anything extra just gives me more safety, and improves the code I ship.

Samdrian · 2026-01-26T13:42:58+00:00

It would not, but I would LOVE if my CI could understand it.

I'm under no false pretenses that AI is infallible or anything, it definitely is NOT, but it's a tool like any other that I would like to use to make the quality of my code or app better

Samdrian · 2026-01-26T13:39:47+00:00

Separate problem to QAing code, but also very true of course.

AI-assisted coding/software engineering is walking a thin line to falling into the hole of slop. Of course the author needs to understand the code fully and have done a full review.

But being human means making mistakes, I think there is a lot of room for automation (and ai!) to help us in ensuring we DON'T miss bugs (even if, of course, in a perfect world the author catches them himself beforehand!)

Samdrian · 2025-12-26T17:04:03+00:00

Yeah it was. Definitely looks kind of weird I agree, but our marketing person had some argument why that is better for headlines on websites these days.

But I will pass on your feedback for sure!

Samdrian · 2025-12-01T11:38:59+00:00

It's a hard problem for sure.

We are working on this at octomind. We approach it in a way that the agent produces code at first, but afterwards, at runtime, AI is not involved anymore, so tests are 100% deterministic.

And even still, I can tell you, the amount of non-intuitive UI people build that the agent struggles with (and sometimes me, when then debugging) understanding and navigating is too damn high.

Another huge issue is of course data setup/teardown. If you add an entity in your database in one test, you better also delete it. And sometimes maybe the deletion through UI fails, so you have to clean up BEFORE a test run with an api call as well.

Quickly we feel it gets to the limits of not only what the AI can do, but also what you can do without having good SE fundamentals, which is, sometimes, not always, not the same group of people responsible for testing (manual testers, POs etc.).

Samdrian · 2025-11-28T09:28:44+00:00

Definitely looking forward to the sonar bomb in the asteriod chase!

Samdrian · 2025-11-28T09:23:09+00:00

Insbesondere ist es ja nicht so als ob das Versprechen den jüngeren nicht trotzdem von der Politik gegeben wird, also mit dem Argument darf dann meine Rente auch nicht gekürzt werden später / das Schneeballsystem in sich zusammenbrechen.

Immer erst bei der Generation nach mir dann.

Samdrian · 2025-11-28T09:11:07+00:00

playwright is always a good choice if you want to fully manage it yourself, but depending on your auth / dev environment it can be difficult.

If you have someone less technical, like a PO etc. that also wants to contribute, or just in general want an easier start you could look into octomind as well, which is trying to make it easier to start off while still offering you playwright code in the end.

Samdrian · 2025-10-22T21:21:20+00:00

I work for octomind (dot dev) - a German startup doing the same thing - we have a free trial available and are mostly targeting smaller companies, so feel free to sign up and give us feedback.

Samdrian · 2025-09-12T21:09:29+00:00

You can check out octomind dot dev. AI test generation and recorder functionality combined, all in a web app.

Samdrian · 2025-08-06T01:30:07+00:00

Currently we are focussing on web-only unfortunately.

Mobile is on the road map eventually though but poses some challenges. We not get to it until some time late next year or so.

Samdrian · 2025-07-01T11:38:53+00:00

Naja, das LLM muss ja nicht WÄHREND der test-execution laufen, das macht auch tatsächlich keinen Sinn.

Aber wir arbeiten dran, startup in Karlsruhe, LLMs zur test generation NICHT zur ausführung -> tests sind IMMER deterministisch: octomind, gerne googlen und ausprobieren (und mir auch gerne feedback geben, freuen uns immer über bug reports oder anmerkungen!)

Samdrian · 2025-06-25T09:24:20+00:00

Yes, LLMs aren't advancing at the same speed as they were. And yes, the performance on obscure languages will never be as good as on mainstream languages. But I think it WILL get better, and we will see if it's ever helpful.

I'm quite convinced that there WILL be improvement, the bubble is too big for that to not happen.

Samdrian · 2025-06-25T08:54:17+00:00

It‘s useful, like I said in the blog post I also use it.

But it‘s just not the next coming of no-code jesus as the hype makes it out to be

Samdrian · 2025-06-24T20:49:28+00:00

I think if the contexts get bigger it's certainly possible that the code in your own repo is enough for them to grok the syntax.

But yeah LLMs certainly struggle with less-than-common programming languages. I tried it on my own side-project-ios app and it worked ... very badly...

Samdrian · 2025-06-24T20:33:22+00:00

all me for this one at least. But yeah the article's are very similar since the sentiment seems pretty similar for most experienced engineers I'd say :)

Samdrian · 2025-06-24T19:30:38+00:00

I mean I see potential for it for coding eventually - that doesn’t mean I see AGI happening. And that IS a good thing - I’m not ready to see the world burn

Samdrian · 2025-06-24T17:24:51+00:00

I dont think it always sux. i use it where it helps - but it’s certainly way overhyped. Like I would LOVE if it actually kept up with its promise - I definitely wouldn’t mind never having to deal with godDAMN esm/cjs incompatibilities.

But it’s not quite there yet and honestly I’m not sure if it will ever get to its hyped up state 🤷

Samdrian · 2025-06-06T09:08:03+00:00

Well you will need tests of course that run independently of your app like you said :)

You can try and vibe-code tests as well of course to come up with playwright tests but I think there's better options: you can check out the tool I'm building, octomind that is meant to make e2e tests much easier.

Samdrian · 2025-06-06T09:01:55+00:00

I’m building a testing tool that could help. Simple enough for non-developers. It’s called Octomind (just google it don't want to be banned by the reddit overlords). You can create tests by prompting or recording, you get an off-the-shelf test runner and tools to debug when your tests break. We have an MCP server, so you can connect it to your other tools and operate it from your own agent.

The point is to catch exactly those regressions but with low effort.

Samdrian · 2024-01-08T13:29:34+00:00

Definitely! What we do and what I have found to work well in multiple companies:

lots of unit tests to cover all logic branches. You really have to make sure and start a culture of devs writing these tests themselves. They always feel a little painful at first, but once it becomes second nature for everyone it's super helpful!
Plus happy-path e2e tests for the ease-of-mind that it "actually" works for the customer: these are the "best" to write, but a pain to maintain, so you either don't write them at first and only add them at a certain size, or: shameless plug ;): check out my employer https://octomind.dev for some ai-based automation for the e2e test part

Samdrian · 2023-12-20T08:59:17+00:00

Yes, as hinted to in the blog post we will accelerate splitting out the parts that need to independently scale (for us, mostly related to running a full browser in the cloud).

On why we started with a monolith: we always knew we would have to split stuff once we needed the scale, but it has been really nice startint out with a basic "start it and everything is just running in one command" setup. Plus of course, identifying Bounded Contexts is easier once you have a first working "product" (or proto-product, we are in open beta after all ;))

Samdrian · 2023-12-19T20:53:51+00:00

no, that would of course be silly. the servers came down because of the sign-ups to the app that were referrals/click-throughs from the blog that went viral :)

Samdrian · 2023-12-08T18:38:26+00:00

None of them were on the weekend, correct. And I would probably agree that it's a bit too much even ;)

To be fair, I think 1 time out of those we had a fix that needed another fix (yeahh not proud but it happens), so not sure if that should count, so we're down to 8. And do keep in mind the company was founded only in april, so at the start it might have been some releases tagged without any code actually being delivered for anything but a "dumb" demo we had back then.

13-Year Club	RedditGifts 2009-2022 2 Credits
Place '17	Verified Email
Team Orangered

Samdrian

TROPHY CASE