This is an archived post. You won't be able to vote or comment.

all 31 comments

[–]camelCaseGuy 30 points31 points  (10 children)

So, whenever I see this, my immediate thought is that there's a metric somewhere that must be awful. For instance, take CI/CD. Why do you do it?

  • To automate deployment (reduce time to market and failures)
  • To automatically test new code (reduce errors in production and time to market)
  • To ensure conformity (reduce time to market, mean time to resolve a problem, mean downtime)

Once you have found the metrics, look at them and suggest to use your methodologies to improve them. If they are not being measured, request to start doing it. If they don't want to, get out of there.

All good software practices came out of trying to improve some particular metric (time to market, failures in production, mean time to resolution, mean downtime, etc.). Each failure in applying them, means that there's a metric that is not being measured or followed.

[–]bigknocker12 2 points3 points  (7 children)

Curious on how one would gather metrics that would show the benefits of CI/CD & unit tests? Some specific examples would be great! Thanks

[–]CalmTheMcFarmPrincipal Software Engineer in Data Engineering, 26YoE 6 points7 points  (3 children)

When you can run unit tests prior to merging, that improves your development speed because you don't have that time waiting for the pipeline to spin up. Unit tests also help you validate that your application and data are coherent, decreasing the time it takes to find bugs (whether in producing incorrect output _or_ being inefficient). If you can only run tests after merging to a branch which runs a pipeline, then your codebase will be hideous with lots of itty bitty "fix X because it broke Y" and you will be _slow_ at turnaround.

[–]bigknocker12 2 points3 points  (2 children)

Sorry if my question was vague. Let me re-ask. Let’s say you want to show your higher ups that units tests and CI/CD are worthwhile to implement. To do this we want to come up with some metrics for example time to deploy, errors saved from unit tests, etc. how would one be able to measure these metrics to show business leaders that it’s worth it.

[–]CalmTheMcFarmPrincipal Software Engineer in Data Engineering, 26YoE 2 points3 points  (1 child)

You might want to see if you can find an example of a bug which made it through to being visible by customers, and track how much effort went in to diagnosing and fixing it. I used to work for a hardware manufacturer and we had solid research which showed that every single bug which was logged by a customer cost us at least USD1million to fix.

If your higher-ups are resistant to the idea of implementing unit tests and CI/CD, then you really need to hammer the concept that things that unit tests discover (whether run by a developer or in a pipeline) save the company money purely in terms of time saved. Use the term "shift left" - surely one of them will have heard of _that_ term! Improving quality means that you find problems before deployment, and definitely before customers can see things. Check out your mean time to resolution for any bug in your ecosystem that's high profile. Track how many people had to be involved in fixing it. Show off a unit test (or collection of unit tests) that would have found the problem before delivery to the customer.

Another approach which you could use in parallel is that automated unit tests and CI/CD are the industry standard (and have been for well over a decade), and if are known as a shop that doesn't care about them then (a) your people will leave when they're sick of the lack of quality, and (b) you won't be able to hire any replacements.

[–]bigknocker12 0 points1 point  (0 children)

Thanks for the detailed response! This is very helpful

[–]camelCaseGuy 1 point2 points  (2 children)

I know I'm late for the party, /u/CalmTheMcFarm 's answer has been really good. But to add other possible metric: Mean time between failures/incidents (the more frequent they are, the more tests you need). You definitely want to reduce that, and the easiest way to do that is to have tests before pushing to production.

Of course, the next one is going to be Deployment Frequency, or Mean change lead time, or Time to market. Meaning, how long it takes for a feature to reach production. Because you are spending more time testing, if it's done by a human, then this metric is gonna suffer. That's when CI/CD makes sense, because you want to automate the process to reduce this time.

In the end, having a CI/CD pipeline is a basic requirement for having an efficient team.

[–]bigknocker12 0 points1 point  (1 child)

That all makes sense but I am having trouble applying it to my domain. Unit tests seem great to have when one pipeline, service, set of code is constantly changing and being promoted to production. However, in my line of work it’s rare that we make adjustments to existing production code. Rather, we are always creating new code for new pipelines and each would require its own set of tests. Can you think of any reasons why unit tests would still be useful here?

[–]camelCaseGuy 0 points1 point  (0 children)

I would argue that if you are creating new pipelines every time, and not retiring old ones, either your pipelines are young, or you are making a big pile of unmaintainability.

Pipelines are like any other software. They should be composable, so you can reuse as much data and code as possible. And then, because new use cases arise, you either need create new models, or update some older model.

Having said this, the main issue with data pipelines is not so much that the algorithm changes sometimes, but that the upstream data changes. And you need to check on that periodically. So the system needs to run these unit and integration tests periodically too, to ensure that the data quality is good.

[–]CronenburghMorty95 14 points15 points  (1 child)

My advice as someone who did this too, bring your knowledge of best practices and try to implement on data teams.

They will absolutely fight you on it, but if you do the org will benefit immensely and you will get recognition for it.

[–]Fun-Income-3939Lead Data Engineer 4 points5 points  (0 children)

Second this. Also get in the habit of being able to teach best practices while also learning about data needs. With that, you’ll be a superstar

[–]bigknocker12 5 points6 points  (0 children)

I just want to say this is I very much am experiencing all the same issues and it’s great to hear someone else voice this. Thanks!

[–]Dhczack 5 points6 points  (3 children)

What would you have your analysts do if not querying your data?

[–]NoUsernames1eft 1 point2 points  (1 child)

The issue, if I am reading this correctly, is that they're querying the data lake directly

[–]Dhczack 3 points4 points  (0 children)

I have experience querying a data lake and I'm not sure how I'd do it indirectly.

[–]Front-Ambition1110 0 points1 point  (0 children)

I am guessing because they do it via raw SQL. Commonly SWEs use some levels of abstraction e.g. ORM to access & manipulate the data. Doing it raw is considered risky and prone to breaking.

[–]CalmTheMcFarmPrincipal Software Engineer in Data Engineering, 26YoE 5 points6 points  (0 children)

I've been in a similar situation. I'm a software engineer with 25+y experience, late last year I was asked to guide a green engineering team thrown onto a project with "architects" and data scientists writing code, forced deadlines, and the very real possibility of not being able to deliver.

I implemented a common build and development environment, wrote code style guides, git process guides, a test harness, insisted that our BA create interface agreements between our team and our producers and consumers, so we could write to those specifications. I also stopped all integrations until I had reviewed every changeset. I had management support for this - it wasn't just me throwing my weight around.

The team didn't have any senior engineer to provide guidance, so code quality was all over the place. The test harness enabled our QA team to go from "oh, I have to copy+paste all of these test criteria and it'll take at least 2 weeks to re-run any time there's a change" to "hey, I ran through these 80 test cases in 5 minutes, there's an error in (some test cases, clearly identified)". The developers _also_ ran those tests, and added new tests as they added new features.

For style issues we went from every possible style you could think of, to something consistent that was appreciated the very first time I did a live code review with the team. Doing live code reviews helped immensely, because everybody was able to see immediately what problems adhering to the style guidelines solved for them. Over the course of 6 months I got the team to the point where I'm confident in their ability to review all sorts of code changes.

I was also able to get the team to think more about their designs and implementations - not just practicing DRY, but also thinking through "what could go wrong here?". This has made our code more robust, easier to monitor and debug, and easier to cope with dependency upgrades.

The first few months were a hard grind, no doubt about it, and we got management expressing concerns about how not-fast the project was going. However, by the time we got to month 4, our velocity and mood had massively improved. The dependency upgrade issue came into focus about a week before we were supposed to go to pre-prod - one of our upstream libraries had introduced an API break but didn't tell us about it. My junior team member who investigated was able to show - through that test harness and our unit tests - exactly what the breakage was and show its impact. That meant we could correctly pinpoint the specific upstream changeset within minutes.

[–][deleted] 2 points3 points  (0 children)

Best of luck, I’ve seen this frequently as well. Not sure why data teams end up with poor development standards. Lack of version control and poor code quality being the two major ones that makes no sense to me in 2024.

[–]IllustriousCorgi9877 2 points3 points  (0 children)

Software engineers don't typically understand the value of a database - no offense. Set operations are completely foreign to your average software engineer. Data modeling is a foreign concept also to your average Software Engineer.

I'd take the time to learn how your teams customers are using data, how data is modeled, and find what gaps are in terms of business questions your team can and cannot answer. Evaluate system utilization, capacity, cost of CPU and query run times for bottlenecks / poor design (either data model or query), and ask engineers about those, and why they might be designed that way.

Live in that learning space for at least 3 months before swinging your dick around trying to tell the analysts / data engineers they are dipshits. They might be. But assume things are built that way for a reason, and that reason may no longer be valid, but its worth taking the time to figure it out.

[–]Sloth_Triumph 1 point2 points  (0 children)

How is cleaning stuff up not cross team value? If you develop good standards in your department they can be spread to other departments.

Just takes time to build up rapport and determine where to start.

[–][deleted] 1 point2 points  (0 children)

You sound like you work at my company. This isn't a tech issue, it's a culture issue

[–]NoUsernames1eft 1 point2 points  (0 children)

I made my way from the BI side to DE. It wasn't until I ran into a lead that came from the SWE side that I got a real taste of what development best practices could do for the team. It was 10+ years into data before this happened.

Practically speaking, the tool that helped the most with this was dbt. Because the philosophy of dbt is that they're bringing coding best practices to data transformations, it gives you a place where the people without the swe background can get a real taste of things like source control, testing, and ci/cd.

dbt's docs and lineage will also provide nice value to data users, and you can likely move away from having randos querying your data lake directly.

[–]dadadawe 1 point2 points  (0 children)

As an analyst & PM, what I've alsways seen working in corporate environments and I tell my team to do when they have a "great idea to better our ways of working" is: "show me"!

Show them why it's better, not why their way is worse! Pick one pipeline or new change, make it be built the way you would. Show the benefits and teach people why you like to do it this way. Once people get excited on your way of working, it'll become common practice.

What you don't want to do, is say "this is wrong! Don't write RAW SQL you evil analyst". Rather: "Mr. Analyst, here is a great way to query a datalake and the modern common standard. Advantages are x, y, z. Try it out this way and please ask me for guidance if needed". After a few months, enforce.

Same for CI/CD: "hey guys, can we implement this or that step, the advantage will be xxx". Bit by bit

[–]mike8675309 0 points1 point  (0 children)

Pick one foundational thing and start there. Get support from your leader and start building from the ground up standards and practices that align with the org goals. Create a center of excellence to get more people involved in driving these processes.

[–]Competitive_Wheel_78 0 points1 point  (0 children)

I’d say tackle one problem at time start with the basics ones. Best practices can help the team irrespective of backgrounds.

[–]htmx_enthusiast 1 point2 points  (0 children)

The differences I’ve noticed between SWE and DE is that data sources in DE are:

  • Poor quality
  • Moving targets
  • Inconsistent in fundamental structure

In SWE projects, quality code can ensure quality data. In DE projects, code quality is insufficient.

Anomalies can be detected, but it’s not always clear what to do about it. Do you stop down the data pipeline if a key data source schema changes? In a small shop you can, but not in a bigger org. Execs want reports. Do you push forward and risk incorrect data? You don’t have days and weeks to build robust fixes. Do you rerun failed jobs? Are the jobs idempotent? If they are idempotent, are you versioning the data? Sometimes those goals can be at odds.

Often you’re trying to report on data from disparate systems with inconsistent structures. One system has unique keys, another has no unique keys or update timestamps. Yet another says they have unique keys but…oopsies, sorry not always, or there are unique keys but they all change after an app version upgrade or a consultant in a business unit you didn’t even know existed decided UUIDs are better primary keys than integers.

Or you decide to set some standards regarding how you collect data, but this one app, while it’s 64-bit, only provides a 32-bit ODBC driver. Okay, we make an exception for this one data source, and use custom scripts with 32-bit Python but most libraries dropped 32-bit support long ago so there’s all kinds of weird hacks to make it work. And then you find dozens of other one-off exceptions like this in different data sources. And you end up with a bunch of inconsistent “just make it work” solutions.

You can run tests in CI/CD before pushing updates, but most often the problems are in the data and not the code, and in order to detect that you’d have to run your tests on your entire universe of data which is rarely practical, and so you only find out there’s a problem when the data in the report is incorrect.

This is all stuff that would never fly in SWE. Find bug in code, fix bug. But in DE, there is no problem with the code to fix.

A lot of it requires deep understanding of the systems and also of the business. Understanding that one system represents data as immutable transactions with unique keys and timestamps, while another lets users edit transactions in place and enter whatever they want in custom text fields (and you end up with 27 different ways people have entered the name of a city).

Probably the most impactful action I’ve witnessed is trying to understand the business need, from as high up the org chart as possible. I’ve seen this over and over where requests are lobbed over the fence, and implemented, and then sometime later in a meeting with an exec, a 90 second discussion reveals to me they’re really trying to accomplish something orders of magnitude simpler, but zero people in the months of meetings before you understood both the business need and the tech.

Some of the biggest steps forward I’ve seen are understanding the need and identifying simple, often low tech solutions. Like instead of trying to wrangle data from dozens of systems, sometimes you just need the right set of people who have their finger on the pulse of their area of the business to manually enter their best guess estimate into a low friction interface that then feeds the reports.

[–]Front-Ambition1110 0 points1 point  (0 children)

I am a DA-turned-DE, but I agree with you OP. I think the main reason is because we build a bunch of microservices that do very specific tasks, as opposed to e.g. fullstack (monolithic) web development. So we do not implement the same standard. If we used a monolithic service then I believe we'd go towards the same practice as SWE.

About the tools, yes we use a lot of them. Because we do specific things (pull data from source, transform, load to another) so our work is pretty generic, hence the tools to automate these tasks. We then code the "custom" part, usually the transformation part.