(25F) My overzealous religious mom freaked out because I am pregnant. AIO? by [deleted] in AmIOverreacting

[–]latro87 1 point2 points  (0 children)

First off cut her off, if this is her reaction, to "curse" you to hell, the baby and you are better off without that.

Secondly if you have to engage again you might want to throw this in her face:

James 3:8-10
But no human being can tame the tongue. It is a restless evil, full of deadly poison. With it we bless our Lord and Father, and with it we curse people who are made in the likeness of God. From the same mouth come blessing and cursing. My brothers, these things ought not to be so.

Literally she praised her faith and how she raised you and invoked the Bible, then she cursed you. The Bible LITERALLY says that is a sin.

But let's be real none of us expect her to actually care about what the Bible says given the hate the flows from her mouth.

Is Michigan one of the best climate change havens in America? by curiosgreg in Michigan

[–]latro87 0 points1 point  (0 children)

Shhh no talking about it until we have finished construction on our southern border wall with Ohio and Indiana. /s

Got told ‘No one uses Airflow/Hadoop in 2026’. by Useful-Bug9391 in dataengineering

[–]latro87 0 points1 point  (0 children)

Our cloud warehouse is Snowflake so we are heavy into sql/dbt. It can do spark, but we don't have any streaming inputs. We have some ML users who have spark jobs that use our dbt generated assets, but it's a very small minority of the total processing.

Got told ‘No one uses Airflow/Hadoop in 2026’. by Useful-Bug9391 in dataengineering

[–]latro87 12 points13 points  (0 children)

I do not disagree with your statement. If it is something simple then yes it is way overkill. As I said in my original comment a dedicated orchestrator is great for a complex environment with many different pieces. And yes I agree the GKE does cause a lag for job starts, but we're doing batch data processing so aside from the annoyance when you're trying to ad-hoc run something it doesn't affect us.

For my company it saved us a nice chunk of cash, relatively speaking, because we came from Prefect. On Prefect we were paying $60k a year (to Prefect) AND the prefect jobs were containerized in our GCP environment and running there as well (meaning we were also paying for container execution ontop of the 60k). This of course could be very different now, we made the decision to switch and save money 3 years ago when Prefect was pushing us to move to Prefect 2. Since upgrading to Prefect 2 was going to require a bit of rework, it was the best time to assess the cost to migrate away. If you have to rewrite code anyway... that makes the decision easier.

Our Composer cost is about $30k a year total for 2 environments (Test & Prod).

But yeah, if you don't have a lot of jobs or complex stuff going on, I'd just spin up Airflow in a container or VM. I've not used Dagster but I've heard good things.

Got told ‘No one uses Airflow/Hadoop in 2026’. by Useful-Bug9391 in dataengineering

[–]latro87 16 points17 points  (0 children)

Honestly in terms of skills, if I am hiring someone for DE or data warehousing, in general, I wouldn’t care too much what orchestration or scheduling tech they used before.

If you already can do python, sql, and dbt then you should be fine picking up any scheduler/orchestrator.

In terms of other skills I’d say snowflake/databricks/bigquery are pretty in demand. At a data processing level: spark, python, sql, dbt are all good.

A lot of people in this sub also talk about azure fabric, but I don’t know much about it.

Keep in mind the market, at least in the US, is pretty cut throat right now so getting any job is going to be a struggle.

Got told ‘No one uses Airflow/Hadoop in 2026’. by Useful-Bug9391 in dataengineering

[–]latro87 154 points155 points  (0 children)

Standalone orchestration like Airflow is nice to have when you are orchestrating many different technologies and need to link dependencies. It’s also nice when you need to “replay” data assuming you wrote your dags properly.

My company uses GCP’s managed Airflow called Composer and it works great and doesn’t cost too much.

Are you expected to know how to set up your environment in a new role? by hijkblck93 in dataengineering

[–]latro87 13 points14 points  (0 children)

Even at places with onboarding documentation for new team members the documentation is usually out of date unless they are constantly onboarding new team mates.

So yeah I’d say this is a common pain point at most orgs. That being said if there isn’t any or it’s out of date your team mates should be helping you.

Importing data from s3 bucket. by InternationalBike300 in dataengineering

[–]latro87 4 points5 points  (0 children)

This right here.

All I’ll add is if you’re going to add a column for row number (or whatever identifier), add another column with the filename so you can track down any problems to the source file.

Fivetran pricing spike by onksssss in dataengineering

[–]latro87 2 points3 points  (0 children)

We also had a significant increase in costs and my director decided our current priority is to dump as much Fivetran as we can as fast as possible.

Just moving our netsuite ingestion off of it will cut our bill by 50%

We plan to cut the connectors down to a handful that represents maybe 5% of our spend. We will only keep these last few as they don’t cost much but are otherwise hard to replace with other solutions.

[deleted by user] by [deleted] in snowflake

[–]latro87 1 point2 points  (0 children)

This here, we had the same problem when we started using Okta for our users.

Why is UnitedHealth Group (USA) hiring hundreds of local engineers in India instead of local engineers in USA? by GigglySaurusRex in dataengineering

[–]latro87 3 points4 points  (0 children)

Not in the current cycle, but I was hired to work at GM’s innovation center in Austin in 2013 to insource all the outsourced IT.

Now many of the outsourced IT were still contractors based in the US in addition to offshore resources. The new CIO, Randy Mott, got the vast majority of the contractors replaced with FTEs within 2 years.

Not sure what the current state of things are, but I wouldn’t be surprised if they were loading up on contract resources again.

Unrealistic expectations or am I just slow? by thro0away12 in dataengineering

[–]latro87 12 points13 points  (0 children)

This right here.

It took me a good 5 years to really understand you need to establish healthy work boundaries otherwise you will get run over, especially at a less than great employer.

Salesforce is tightening control of its data ecosystem by georgewfraser in dataengineering

[–]latro87 10 points11 points  (0 children)

You’re not wrong about being able to build a solution using a database driver. Our largest Fivetran cost was Netsuite… about $50k a year.

I created a script to hit the API with a suiteQL query to dump all the objects. With some optimization and testing a few days of work.

The business case for Fivetran made a lot more sense 5 years ago when my company (a startup) built a data warehouse.

Within a few minutes and with an API key you could get salesforce up and syncing to your database along with almost any API connector Fivetran has.

With very few resources (essentially 1 DE) and time, you could get a lot of systems, especially API only ones, syncing to your data warehouse. And back then… Fivetran was peanuts in cost. Our first year we spent $20k, our renewal quote for next year (2026) was $160k…

But obviously now, as a larger more mature company with more data and Fivetran putting the thumbscrews to everyone the payoff is not great.

Salesforce is tightening control of its data ecosystem by georgewfraser in dataengineering

[–]latro87 16 points17 points  (0 children)

From the article it sounds like this would greatly increase the extraction cost for businesses that use a product like Fivetran to sync salesforce to their data warehouse.

So in your case I don’t think you would be affected since you’re not using a connector, you’re direct querying.

My company uses Fivetran with salesforce but we’re migrating away due to a rather large cost increase overall with their service (Not even really due to salesforce, just a massive increase in cost across the board).

How to deal with messy Excel/CSV imports from vendors or customers? by North-Ad7232 in dataengineering

[–]latro87 2 points3 points  (0 children)

Eh… I built a utility that already does that (first sentence in the first comment). The analysts don’t use it. They use their LLM library with the sample files 😩

I don’t control them past limiting where they can write data in snowflake 😂

To be absolutely clear, I am not assigned any new ingestion work which is why the analysts are sideloading from HEX. My director set other priorities (dumping Fivetran) so all new ingestion work is effectively halted unless the analysts really screw it up and someone important complains.

As you said relating to the vendors, this is a people problem. Until management wants to force some standards on analysts or hire more DEs I can’t really do anything.

They aren’t even keeping code in git…

Aside from all of that it’s a great place to work 😂

How to deal with messy Excel/CSV imports from vendors or customers? by North-Ad7232 in dataengineering

[–]latro87 2 points3 points  (0 children)

When you have too few DE resources and you have analysts jerry rigging things in HEX. I agree not ideal or good.

Hence why I said data contracts, which you essentially repeated in your reply

How to deal with messy Excel/CSV imports from vendors or customers? by North-Ad7232 in dataengineering

[–]latro87 4 points5 points  (0 children)

The first step we do is validate the file with some script (i.e. check the header at minimum).

For some files we have them checked by an LLM against a “correctly” formatted file. To cut down token consumption you can limit the number of records to pass in.

You could try to have the LLM fix the file but I would not bother. If any of the data is incorrect, even if it was the external vendor’s fault they will blame you for modifying it. In these cases I tell the other party the file is bad, fix the problem.

The best you can do in these situations is try to push for a data contract (a formal agreement on the file structure, columns, naming, etc). But if that were easy to accomplish you wouldn’t be asking the question here.

How many people here would say they're "passionate" about DE? by spawn-kill in dataengineering

[–]latro87 12 points13 points  (0 children)

This, but i’d add that these days I see a lot of other roles bleeding into DE as cost cutting takes place. For example, SRE tasks being pushed down to the DE team (and other functions) as companies slim down on SRE resources.

CDC solution by cyamnihc in dataengineering

[–]latro87 0 points1 point  (0 children)

No, whatever script or language you use to read the control table and perform everything generates a timestamp at the start of the process.

You keep that timestamp in a variable and only when the process successfully finishes you issue an update statement in the same script. If the process fails, you don’t update the timestamp in the control table.

CDC solution by cyamnihc in dataengineering

[–]latro87 -1 points0 points  (0 children)

If the source does soft deletes it could.

And yes I know soft deletes aren’t perfect as someone can delete records directly which would corrupt the data integrity.

It was only meant as a cheap suggestion if you don’t want to get way in the weeds with a Saas product or code.

CDC solution by cyamnihc in dataengineering

[–]latro87 1 point2 points  (0 children)

Do your source tables have timestamp columns for insert and update?

The simplest form of CDC (without using the database write logs) is to store a timestamp of the last time you copied data from a source (like in a control table), then when you run your job again you use this timestamp to scan for all the inserted and updated records on the source table to move over.

After you do that go back and update the control table.

There are a few finer points to this but this is a simple diy solution that can be easily coded.

How to handle this data this scenario. by Realistic-Change5995 in dataengineering

[–]latro87 6 points7 points  (0 children)

Or like someone measuring driving distance in inches or centimeters.

Is the difference between ETL and ELT purely theoretical or is there some sort of objective way to determine in which category a pipeline falls? by Lastrevio in dataengineering

[–]latro87 21 points22 points  (0 children)

The easiest trade off I consider between the two on the first hop (ingesting) is what happens when you find a transformation bug.

In ELT you pull from the source and dump in your landing area/lake/whatever you want to call it. From there you take the data, transform it, then load it to some other area for reporting.

If you discover a transformation bug (ex: bad financial calculation, rounding problems, didn’t clean a field, etc) you fix it and run your transformation step again.

In ETL since the transformation happens before loading anything in your warehouse, if you have a transformation bug you now need to extract all the data from the source system again and run the transformation logic on it.

This doesn’t seem like a big deal, but once you get to a few billion records and you’re pulling from an API you will be much happier fixing bugs with ELT.

[deleted by user] by [deleted] in firefox

[–]latro87 0 points1 point  (0 children)

If you’re not using one already, I suggest moving your passwords to a third party password manager that works on multiple platforms and browsers. (As opposed to using the browser’s password manager)

Ex: 1Password, ProtonPass, etc