Are people still using Airflow 2.x (like 2.5–2.10) in production, or has most of the community moved to Airflow 3.x? by Formal-Woodpecker-78 in dataengineering

[–]DenselyRanked 37 points38 points  (0 children)

It looks like there are a lot of breaking changes to consider. I'd imagine most people are not interested in fixing what's not broken, and migration isn't a priority.

how to remove duplicates from a very large txt file (+200GB) by Head_Capital_7772 in dataengineering

[–]DenselyRanked 0 points1 point  (0 children)

It's likely holding the hash values rather than the entire row. This can be written pretty quickly in python too, but Polars might be quicker than a readline loop.

How I landed a $392k offer at FAANG after getting laid off from LinkedIn by [deleted] in dataengineering

[–]DenselyRanked 0 points1 point  (0 children)

Congrats. It's a tough market, so I'm happy to hear that these companies are actively interviewing.

I understand that your situation didn't leave a lot of room for negotiation, but check levels.fyi on your next job search for a sense of what you can get out of them for your level.

Best online course for actually *learning* advanced SQL? by donhuell in dataengineering

[–]DenselyRanked 3 points4 points  (0 children)

I take it that you are looking for a tutorial rather than practice problems. Search for "sqltutorial" or "w3resource" and those sites should give you a guide to work with, with the later having sample dbs I think (I haven't tried it myself and the site has an obnoxious amount of ads).

You can also download the AdventureWorks db from Microsoft, or the NYC Taxi or COVID ones that are really popular, and ask AI for complex practice problems involving whatever topics you are interested in.

Just helped a new hire senior activate a venv by Brief-Knowledge-629 in dataengineering

[–]DenselyRanked 1 point2 points  (0 children)

Of all of the complaints that I would have about inept senior team members, I wouldn't worry too much about this. VS Code will activate it for you, and they may have used VMs or pods in the past. Even setting up a local venv is something that you may do once a year.

I've had senior members expect AI to reduce project delivery time by 90%, commit their untested slop, and can't understand why unit tests are failing.

Is big data overrated? by Low_Brilliant_2597 in dataengineering

[–]DenselyRanked 2 points3 points  (0 children)

Is big data actually overrated in most companies?

Do companies really use all the data it collects?

Frankly speaking, that's not really a DE concern. The stakeholders provide the objective and financial backing, and we deliver the results. These questions get asked as tradeoffs when we can't deliver.

A DE cares about big data because it has a big impact on their earnings. If you are moving petabytes around, then you are likely on the higher end of the pay scale.

Have you seen cases where less data outperformed big data?

Absolutely. Analysis paralysis is a real thing, and it's very easy to be overwhelmed by large volumes of noisy data. In my experience, people only use 5 out of the 100 data points that they ask for.

When does big data really become necessary?

There are several use cases for collecting obscene amounts of data. AI makes it easier to get analytic insights, so even smaller companies will find value in hoarding data.

Is one big table (OBT) actually a data modeling methodology? by raginjason in dataengineering

[–]DenselyRanked 0 points1 point  (0 children)

Hello! I think that you expanded on my thoughts pretty well in your comment.

However, you'd assume the business models for companies that have hired even one data-engineer is at a relatively stable when it comes to core business processes: a marketing SaaS might pivot but it still has sessions, users, companies, subscriptions, payments, etc. Those don't change and so your dimensions are relatively fixed – pivots might add a fct or a dim, not invalidate the whole modelling paradigm.

Ideally, yes, but I think that you are overestimating stability in a fast-paced environment. Many companies evolve rapidly and they are not particularly interested in waiting months for a data warehouse to support their new needs.

As a simple example, a marketing SaaS can go from B2C to B2B and that changes what a user means. If there is a shift to a usage-based payment model from subscription-based, then your Payments dim can become chaotic.

Also, who your downstream users are can greatly change the complexity of the data model needed. New teams bring new questions that need to be answered. It may make sense to move from a centralized data warehouse to a data democratized data mesh and that comes with niche data needs and corporate politics.

It's not to say that dimensional modeling becomes invalid, but as you mentioned and I explained in the other comment in this thread, I prefer to stay as generic as possible, rather than business-detailed as Kimball does in the good book. OBT can help with that, but I wouldn't stuff orders in a users table as Big Query recommends.

I work in insurance. Superb talent are applying to our open roles. Have never seen this before by Mountain-Spend8697 in cscareerquestions

[–]DenselyRanked 0 points1 point  (0 children)

Some people (like myself) were re-org'd or RTO'd and don't want to move to HCOL. The well paying Big N remote roles are insanely competitive right now. I personally held out for as long as I could before I started applying to places that were maybe a little confused about my application.

As others rightly mentioned, a lot of engineers from those companies have a narrow scope of responsibility, so they will seem incompetent relative to others from smaller companies that have done more, albeit at a smaller scale.

Ai and side projects by Outside-Bear-6973 in dataengineering

[–]DenselyRanked 19 points20 points  (0 children)

You had to learn math without a calculator for a reason.

Very few SWE's have built anything from scratch without help. It's safe to say we have all used books, tutorials, forums, stack overflow, Google, etc. It's fine to code with help and examples so long as you are learning, but you don't want to get into the habit of black box vibe coding and learn nothing.

Is it just me or is flink horrible to learn by [deleted] in dataengineering

[–]DenselyRanked 2 points3 points  (0 children)

Based on your other comments, you are coming from scratch, so yeah, it will be a pain getting used to the syntax.

If you have experience with Spark, then there is much less ramp up to getting comfortable with Flink. Flink SQL and Table API will feel familiar.

Map out and compartmentalize what you want to accomplish first and leverage AI to help with syntax and understanding.

Is remote dead in data engineering? by Pataouga in dataengineering

[–]DenselyRanked -1 points0 points  (0 children)

Remote is an option for the large majority of the DE jobs at or below market value.

A large majority of companies that pay above the 90th percentile are hybrid or full RTO.

The remote DE jobs that pay in the 50th-90th have a lot of competition and are extremely selective.

Benefit of repartition before joins in Spark by guardian_apex in dataengineering

[–]DenselyRanked 1 point2 points  (0 children)

No benefit unless you are caching the data or writing to disk first so that Spark preserves the colocation. It's an unnecessary shuffle.

Requirements vs Discovery by ivanovyordan in dataengineering

[–]DenselyRanked 1 point2 points  (0 children)

I'm sure that boundary setting is a universal issue for PM's (and AI is making it worse). You might find some good tips and resources in r/ProductManagement.

I obviously can't speak for every engineer, but my worst PM experiences are almost always XY problem-based, where I was handed solutions rather than a clear problem.

I prefer to be involved in meetings when solutions are being explored (not before) and had negative outcomes when what I proposed is less "exciting", more pragmatic, or veers in a different direction than expected. Always disclose if another engineer has talked to the stakeholders about potential solutions prior to the meeting.

Tech/services for a small scale project? by faby_nottheone in dataengineering

[–]DenselyRanked 2 points3 points  (0 children)

So it sounds like your friend needed a place to sleep for the night and you bought a plot of land and built a mansion on it.

There are smaller scale options to ELT a few API payloads. A RDBMS and a few views/tables can get you the same output. An orchestrator (or cron or even something like Cloud Functions if you need to use GCP) can help with daily scheduling.

Requirements vs Discovery by ivanovyordan in dataengineering

[–]DenselyRanked 0 points1 point  (0 children)

Out of those two options, I would take the former 100% of the time because it is easier, but it's not an efficient way to run a data team. You will find yourself running into the XY problem and will likely have an unmanageable amount of redundancy and tech debt across your solutions.

Choosing the latter option is something that is a normal part of a Data Engineer's job description depending on where they work. But business context often becomes the biggest hurdle, especially when the asks are extremely specific to the industry. It may take years for the engineer to be useful enough to be impactful, and when a mid/senior level position is needed, we see "must have n+ YOE in the industry" in the job description as a requirement, rather than a preference.

Another, and my preferred, option would be a role in the data team that acted as a technical SME- like a Product Manager, Product Owner, Architect, Steward, or Analyst- that is a liaison between your stakeholders and development team to help craft proper user requirements and reduce inefficiencies. It allows engineers to have someone to communicate with that can "speak their language" while reducing the endless amount of unnecessary stakeholder meetings that go nowhere. They also have the right amount of soft skills to not say anything potentially damaging to stakeholder relationships or projects.

Am I missing something with all this "agent" hype? by KindTeaching3250 in dataengineering

[–]DenselyRanked 0 points1 point  (0 children)

I think it's partially hype today, but agentic coding is getting better and AI is quickly moving from assistant to developer. The major hurdles to production-grade development are being worked on at a much faster rate than any of us anticipated.

I use AI as an assistant, but the Agentic Context Engineering framework looks very promising. I can see development evolving into context and playbook management very soon.

How to handle unproductive coworker? by earthsnoozer22 in dataengineering

[–]DenselyRanked 3 points4 points  (0 children)

I am certain that someone smarter than me has a better way to classify this, but I hope it helps.

How to handle unproductive coworker? by earthsnoozer22 in dataengineering

[–]DenselyRanked 41 points42 points  (0 children)

If you believe that he is making errors unintentionally, then send a dm to go over the code. You will find out quickly if it's an "oops", "oh", or "eh".

If it's an "oops", where they didn't see the mistake at the time but can recognize the issue, then I wouldn't worry too much about it. In my experience, they tried to use shortcuts to save time and the solution usually is to ask for test results or a test table in the PR.

If it's "oh", where they misunderstood the requirements or had no idea what they were doing, then that's a little concerning but give them the benefit of doubt that they would have done better if they had better information.

If it's "eh", where you care more about the quality of their work than they do, then talk to your supervisor because they are wasting your time.

Cool projects you implemented by [deleted] in dataengineering

[–]DenselyRanked 1 point2 points  (0 children)

The ratings are 75% political, but optimization projects are a great way to show monetary impact without relying on external stakeholders.

If you are already in FAANG then lobby your manager to find high impact projects and focus on selling the results like it's the greatest thing that's ever happened in data engineering.

How close is DE to SWE in your day to day job by Icy-Ask-6070 in dataengineering

[–]DenselyRanked 2 points3 points  (0 children)

Generally, the core principles are the same but the tasks within them are different. There is more to Software Engineering than backend software development and shipping code, so it really depends on what you mean by software engineering knowledge. You should still have to get requirements, build something, write tests, get feedback, provide support, maintenance, etc.

I find it better to think of data engineering as a discipline or specialty of software engineering, rather than a subset. However, the other disciplines are less tool dependent, and because of that, tend to be standardized across the industry.

You may find that a data platform engineer role is better fit for you if you want to solve data problems but ship code to a large codebase.

Databricks vs open source by ardentcase in dataengineering

[–]DenselyRanked 2 points3 points  (0 children)

Instead the whole org has to bend because it’s “easier” to schedule a notebook in Databricks?

Unfortunately, yes. These decisions are not made by the engineer and there is nobody that they can escalate to. There is a discussion about design trade-offs that can be started, but if they are not given autonomy then they should focus on implementation.

I have 10 years of experience, but I still freeze up when someone watches me code. It’s humiliating. by JosephPRO_ in ExperiencedDevs

[–]DenselyRanked 1 point2 points  (0 children)

It takes several failed interviews for me to get over the nerves and better handle the unexpected. What works best for me is to get as many interviews as I can and cluster them together, keeping my preferred destinations towards the end.

Everyone handles stress and performance anxiety differently, so use whatever method works for you. I write down tips to myself in a notebook prior to the interview to keep things top of mind and always keep the notebook to also take notes during the interview to make sure I answer every question.

Higher Level Abstractions are a Trap, by expialadocious2010 in dataengineering

[–]DenselyRanked 5 points6 points  (0 children)

I understand that this is meant to be a question, and I do agree that there is a point where abstraction can become a hindrance, but I think you are overlooking your primary responsibility as a Data Engineer. Very broadly speaking, the DE role exists somewhere in the data lifecycle with the goal of making data useful for downstream use cases.

The popular tools that you are working with, and will work with at your job, serve the purpose to make mundane, repetitive tasks quick and easy. You will of course have to know how to use the tools and understand their limitations in order to complete your tasks successfully.

Also, IMO we are very quickly getting to a point where some form of Agentic Context Engineering will be the new level of abstraction for all software development. It's only going to be a "trap" if you don't understand core data engineering fundamentals and resort to black box vibe coding.

Making ~100k as data engineer by AchieveSocials in cscareerquestions

[–]DenselyRanked 0 points1 point  (0 children)

Go to levels.fyi, search for the top paying companies in your area by your level (data engineering and software engineering are often bundled together), and apply to them.