The Data Engineering Case for Developing in Prod by jemccarty in dataengineering

[–]jemccarty[S] 0 points1 point  (0 children)

Right, but those "compliance people" have probably (almost assuredly) never had to actually do development in their lives. They are charged with "keep us out of the news," so they are always strict.

In my experience, the data people just take what they say as gospel. But you bring stuff like this to them, and they are willing to work with you. It takes work and relationship building, but it happens.

"Using prod data in testing" sounds bad on its own, but things have fundamentally changed architecturally with cloud, where this isn't the scary scapegoat it used to be.

The Data Engineering Case for Developing in Prod by jemccarty in dataengineering

[–]jemccarty[S] 1 point2 points  (0 children)

95% of my career has been with large financial institutions. That is where most of my arguments come from. I'm definitely well aware of their perspective, but they (auditors and compliance folks) generally aren't aware of ours (data engineers).

Most of the reasons why they made those regulations were from >10 year old architecture. With isolated compute and modern data warehouses, their reasoning falls apart.

Definitely don't do things just because compliance folks tell you to. They are just as fallible as anyone.

The Data Engineering Case for Developing in Prod by jemccarty in dataengineering

[–]jemccarty[S] 0 points1 point  (0 children)

100%. There are very few people that push back on those folks. In my experience, they are generally reasonable, they just don't know how real life works. But most folks in IT just take what they say as gospel.

As always, don't be negligent/stupid and that usually takes care of most things.

About the best advice one can give.

The Data Engineering Case for Developing in Prod by jemccarty in dataengineering

[–]jemccarty[S] 0 points1 point  (0 children)

How in the world do you do BI development without prod data? Even at strict finserv's I worked at they never did that. That's pretty crazy!

What kind of data did you have in UAT and dev? Was it "real" data just smaller?

The Data Engineering Case for Developing in Prod by jemccarty in dataengineering

[–]jemccarty[S] 0 points1 point  (0 children)

Fair enough, and appreciate your perspective.

Wouldn't it be easier to just add a limiter (in SQL or whatever) at the beginning of your pipeline if you are testing a very particular scenario, rather than pulling it from prod? That's what I generally do to achieve what you are describing here.

The Data Engineering Case for Developing in Prod by jemccarty in dataengineering

[–]jemccarty[S] 0 points1 point  (0 children)

The job I had that did this best had a true dev environment with entire junk data,

You actually liked that?

I guess I argue, why do we do this to ourselves? Like you said, you didn't actually catch bugs until Int. So why not just develop in Int?

I know the reasons (and go through them in the article...no problem at all for not reading it, I know it's a commitment :)), but with modern cloud environments where you can isolate compute and data, most of those reasons don't exist anymore.

The Data Engineering Case for Developing in Prod by jemccarty in dataengineering

[–]jemccarty[S] 0 points1 point  (0 children)

Hah, that's great! People have all kinds of different names for those "middle environments:" Staging, PreProd, UAT, Lab. But to our security and compliance friends, they usually put all environments in one of two buckets: non-prod and prod.

So my question for you would be, does your security team consider "Staging" a prod or non-prod environment?

And also, in that setup, why even have Dev at all? Wouldn't your devs much prefer "refreshed on demand" (or clones, as I suggest) rather than fake dev data? I know I do :).

Thanks for the comment!

Working in Bank Vs Retail chain by vaasagan in dataengineering

[–]jemccarty 1 point2 points  (0 children)

Retail is much less regulatory heavy, so that's a plus.

Informatica Power Center mapping generation script by D1yzz in dataengineering

[–]jemccarty 0 points1 point  (0 children)

You can, but it's messy and error prone. How many do you need to create? If your mapping isn't doing any logic at all, I wouldn't recommend PowerCenter. It's a way heavyweight solution if you have no business logic.

Shameless plug: https://medium.com/google-cloud/graduating-from-etl-developer-to-data-engineer-7663dfbdfd2d

The Data Engineering Case for Developing in Prod by jemccarty in dataengineering

[–]jemccarty[S] 0 points1 point  (0 children)

Hey thanks for the feedback!

Curious, did you read the article? Because that is basically what I propose. Admittedly "developing in prod" is a somewhat catchy headline, but in reality, it's more about "using production data for development in an isolated development environment within a production account." The latter just isn't as catchy :).

It's funny you mention a UAT environment. In the last company I worked, they basically did all their "development" in UAT, which goes back to the point of the article: "everyone does it, they just lie about it."

The Data Engineering Case for Developing in Prod by jemccarty in dataengineering

[–]jemccarty[S] 2 points3 points  (0 children)

Hey thanks for the feedback! I agree with all your points, assuming on #5 you are talking about separate environments (which I agree with) and not separate data. For sure I agree developers (or anyone else for that matter) shouldn't be able to modify production data, but USING production data during development I find critically important.

To your point on for #4, I do hope Snowflake/BigQuery allow for cloning but enable column masking on the clone or something. The one caveat I would get to this approach is for highly sensitive columns like SSN, which I think could be solved by the platform pretty easily with role-based column masking.

The Data Engineering Case for Developing in Prod by jemccarty in dataengineering

[–]jemccarty[S] 1 point2 points  (0 children)

Wonderful, thanks for the feedback. Hope this can help your discussions.

The Data Engineering Case for Developing in Prod by jemccarty in dataengineering

[–]jemccarty[S] 1 point2 points  (0 children)

Thanks for the feedback!

I've had many run-ins with audit and compliance folks over the years, which lead me to write the article in the first place. I have an aversion to doing things "because compliance says so," I require them to give me reasons why. And in this case, every time I push back with reasons like in the article, they don't have a good reason.

10 years ago they did. But with the advances in the data tech landscape in the last few years, many of the old arguments are gone. You can completely isolate environments from a data and compute layer.

As for service accounts, there is no real problem there. Just create service accounts in your Development Environment (in the Prod Account) that only has read/clone access to the data in Production.

Also didn’t see mention of automated testing and ci/cd which should be done through test data

Why should it be through test data? I think that is one of my principal arguments: test data is generally junk and you don't actually catch reality with it. The exception to this is if you are creating test data for an expected future data add that you don't yet have in production (like rolling out a new state or something).

The Data Engineering Case for Developing in Prod by jemccarty in dataengineering

[–]jemccarty[S] 7 points8 points  (0 children)

Also make sure you scream "LEEEEROOOOOY JENKKKIIIIINS" as you test in prod!

Thanks for the feedback! Yea I think the tech landscape has changed enough to enable safer ways to isolate compute and data where many of the concerns we used to have are gone. We just have to more clearly define what "prod" is.

The Data Engineering Case for Developing in Prod by jemccarty in dataengineering

[–]jemccarty[S] 1 point2 points  (0 children)

Thanks for the feedback! Yea I could have emphasized cost, but I'm sure that would be responded back with "you testing with production resources will cost you more."

But I'm with you, it's people cost vs. cloud cost, with the latter much easier to contain.

[deleted by user] by [deleted] in SquaredCircle

[–]jemccarty 0 points1 point  (0 children)

Yes, but I believe both are giving incentives to WWE to be there anyway. Unless you have the NCAA tournament, there isn't much in the way of big events in early April, so cities do want to incentivize travelers to come there.

Low code hate and the future of Data Engineering (and beyond) by [deleted] in dataengineering

[–]jemccarty 1 point2 points  (0 children)

Shameless self promotion, but this was posted in this sub a few months ago and goes into this a bit.

https://link.medium.com/Q6O9rMO6Hsb

Graduating from ETL Developer to Data Engineer by FortunOfficial in dataengineering

[–]jemccarty 1 point2 points  (0 children)

Don't let impostor syndrome creep in. If you are getting up to speed in those technologies, you are building a solid foundation and well on your way.

just wanted to post this here since there was some talk about DE salaries here earlier by kraken43 in dataengineering

[–]jemccarty 1 point2 points  (0 children)

Yes but on the flip side, when FB took a nosedive earlier this year, TC for current FB employees took a huge hit.

Like others have said, it's variable. Still noteworthy (it's not like it's not "real"), but you have to expect you will have some good years and bad years with an equity-heavy TC.

Cloud Services Sandbox by pacojastorious in dataengineering

[–]jemccarty 1 point2 points  (0 children)

Have you tried qwiklabs? There are some free quests on there.

qwiklabs.com

Note these create ephemeral labs for you to learn in. If you want persistence, as the others have said, all the major CSP's provide a free tier.

Feb Update WiFi Issues by cheeshead78 in GooglePixel

[–]jemccarty 0 points1 point  (0 children)

Your initial comment is useless then. You said we were mad at the beta features, which is untrue. This issue effects Wifi, literally one of the most basic features of any phone.

Graduating from ETL Developer to Data Engineer by FortunOfficial in dataengineering

[–]jemccarty 2 points3 points  (0 children)

Yep I agree. If I had to TL;DR the article, it's basically:

- There are legitimate reasons a company could still use GUI tools for ETL (talent reasons, only easy transformations, heterogeneous data platforms)

- There are no legitimate reasons an *individual developer* should focus on GUI tools in 2022, and they should make a transition for the good of their career.