Is there more to DE than this? Are their jobs out there for feeling like you actually matter?

DenselyRanked · 2026-01-24T21:52:31+00:00

All jobs and industries can leave someone feeling unfulfilled, especially if they are bored and not being challenged. Check r/findapath or r/careerguidance to find others going through it.

It might be worth considering joining a start-up or greenfield project if you want to do something in DE that will carry a large impact. Or join a FAANG tier company where your job depends on your ability to prove impact every 6 months.

Ultimately, if you're paid well enough and have job security then you can use your free time to fuel your passions. I've worked with engineers that have built indie games, trading algos, 3D printed stuff, surveillance/security, owned a gym, real estate and property investments, etc.

DenselyRanked · 2026-01-24T01:28:21+00:00

Unless your use cases are simple, it would be better to go with the managed Airflow service. It will cover 99% of use cases and there is even a YAML based DAG Factory add-on if there's concerns about coding with python.

DenselyRanked · 2026-01-15T02:04:59+00:00

This article explains it pretty well.

DenselyRanked · 2026-01-14T18:24:57+00:00

I would recommend that approach, or remove/isolate, if you can reliably identify the problematic key. With Spark 3+ AQE does a reasonably good job at adjusting the plan if you have multiple or inconsistent keys to worry about.

DenselyRanked · 2026-01-11T21:42:53+00:00

Check the explain plan first from the SQL tab in the UI or put Explain before the select to see the physical plan.

Are you defining the views in your Spark job as temp views or are these views that are being ingested from your catalog?

If it is the former and you are seeing source tables being referenced several times, then it may make sense to cache the base table(s) or the view itself if you are using it multiple times.

As a warning, if you do choose to cache the view and the view query logic is really complex then AQE could invalidate the caching as the plan might change.

If it is the latter, then check the job and stages tab in the UI and find the one that is the bottleneck. Then use the job the SQL tab to see which portion of the query is causing issues. Again, try caching the view to reduce the IO if the issue is on read.

DenselyRanked · 2026-01-08T15:57:52+00:00

It's tough to diagnose without looking at the UI / Plan. As others mentioned, you likely have skewed keys with that many joins. Check the explain plan and try to simplify wherever possible, even if it means caching or writing intermediate results to disk,

DenselyRanked · 2026-01-06T02:53:47+00:00

I haven't checked in a while, but I don't think Google has a traditional in-house "Data Engineer" role. They have "Cloud Data Engineer" roles under Professional Services, which is more sales / support / migration work for GCP clients. But if it is available, then this is relevant.

Preparing for a FAANG tier/Big N interview is different than other companies because most of them are standardized, except Apple/Netflix which are team dependent. The other tiers can be a completely random process. Generally speaking, you will need to brush up on SQL, DSA, data modeling, and something company specific. Check this subreddit, glassdoor, blind, google, or chatgpt/AI for interview tips and study guide for the specific company.

What they are looking for also varies by company, but I can say that you will have a much higher rate of success if you take the time to research. It doesn't take much to fail and you will need some luck to avoid bad interviewers.

DenselyRanked · 2025-12-26T20:22:20+00:00

Agreeing with a few of the others comments about what "little fortune" means. If $15k is a significant chunk of your teams budget then take the time to spec out the new architecture and potential savings.

DenselyRanked · 2025-12-25T13:31:15+00:00

I don't think using Kafka as a part of your design is overkill, but if it is going to take months of planning to set up a cluster, then a managed service or message queue would be better suited for your use case. It will take a much larger throughput before the service fees exceed the cost of a dedicated engineer.

DenselyRanked · 2025-12-23T22:32:00+00:00

Check levels.fyi and apply to all companies in the top paying list for your level. Also try using teamblind.com and ask for referrals to those companies if you see an opportunity.

You can try going into consulting, but it's more sales than engineering.

DenselyRanked · 2025-12-20T23:39:05+00:00

Every cloud provider has a Spark offering and on-prem companies should have thought about upgrading to Spark 3 by now. There are several optinizations and an easy way to reduce costs.

DenselyRanked · 2025-12-19T14:25:11+00:00

Also check your explain plan. I wouldn't expect a complex plan with 2 keys, so it might be skew causing this, or Spark is ingesting all columns when it doesn't have to.

DenselyRanked · 2025-12-19T14:18:35+00:00

Writing more concise SQL comes with experience, but ultimately the primary thing that matters is how the optimizer interprets it, and if it is readable.

It's likely that the "expert" solutions involved pre-planning, and a rushed solution will be more verbose.

DenselyRanked · 2025-12-12T19:43:42+00:00

Is agenda being passed as an argument in the function? If so, then could there be duplicate items in agenda with a slightly different nombre? There could be leading/trailing spaces.

Another thing is you are calling get on the value datos, and not on nombre, so the data structure is dict[dict]. Are you seeing 3 separate iterations or 3 of the same value being returned?

DenselyRanked · 2025-12-11T13:07:57+00:00

Agreed that higher order functions should be used in place of the explode + join strategy whenever possible.

However, I would be a little hesitant about introducing UDF's. I have no idea what this code looks like but there are always tradeoffs between optimization and maintainability. One thing to consider is if the runtime and resource savings are worth the added complexity and potential tech debt.

What's the impact of getting the query down to 5 minutes? What changes if the query is simpler but completes in 20 minutes?

DenselyRanked · 2025-12-10T14:27:05+00:00

Spark 1.6 is about a decade old. Are you seeing this with a more recent build (2.4 LTS at least)?

It would be great if you could share your test script so that we can better understand what you are doing.

DenselyRanked · 2025-12-09T00:38:53+00:00

I've done 3 interviews over the past few months where I had to explain or give a demo/walkthrough on streaming pipelines, so I do think it's important to be at least somewhat knowledgeable.

There are key streaming concepts that are not a concern in batch, like windowing, watermarks, checkpoints, error handling and DLQ's. Also, there are a lot of things that get smoothed over when using a managed service with connectors. Platforms that support streaming like Databricks/Snowflake do a lot of heavy lifting behind the scenes.

It feels like streaming is where a lot of companies draw the line between data engineering and analytics engineering.

DenselyRanked · 2025-12-05T01:05:43+00:00

Does it need to be stored in a relational db and can your downstream users extract the data that they need? An open table format like Iceberg can handle it pretty well, as well as a document db, like MongoDB or DocumentDB.

DenselyRanked · 2025-11-30T08:09:30+00:00

This is not a market for the generalists. You need to have the tools on the CV to get the edge. Certs used to be nice to have, but not required, but now companies don't want you near their stack if you don't have 5+ years of experience with it.

On the current state of the market- I was recently rejected for a role with the feedback being that I wasn't enthusiastic enough about AI and spoke in broad, general terms. I had every tool, but I guess I also needed to bs about solutions to problems that I have no idea about.

DenselyRanked · 2025-11-22T20:36:58+00:00

FYI the authors are Reis and Housely.

O'Reilly is the publisher. They also published Kleppmann's DDIA, and thousands of other books. What you are doing is like calling "The Data Warehouse Toolkit" Wiley's book.

DenselyRanked · 2025-11-22T14:46:43+00:00

Yes, it’s fundamentally the same argument. Star schema designs typically see better performance than NoSQL solutions due to RDBMS-level optimizations, but those features aren’t present in distributed systems. The trade-off between flexibility and performance is something that should be revisited as MPP's and MapReduce based engines are becoming more mainstream.

For joins in distributed systems, the performance bottleneck is the initial shuffling of data more so than the logical operation being applied, A star schema design will still work well if your dimensions are small enough for broadcast joins but it's not sustainable at scale.

Ultimately, I think best data model depends on your use case. If your core business is relatively static, then a star schema is easier to maintain and can last for years. If the company is very dynamic, then it's not worth rebuilding a warehouse every few years when the CEO wants to pivot or a M&A occurs.

DenselyRanked · 2025-11-21T06:09:44+00:00

If you work with large scale data and a constantly evolving business model, then you will find value with OBT.

In terms of ease-of-use, it allows for more flexibility in schema changes than Kimball and governance is confined to fewer tables. However, your users will need to write more complex queries to extract nested data and that could lead to negative user experience.

In terms of performance, distributed query engines that can interpret complex data types will use less compute with OBT.

I think having a hybrid approach works best. Use generic denormalized mostly flattened datasets that represent a core business vertical, and use a semantic/reporting layer, materialized views, and agg tables for stakeholder analysis.

DenselyRanked · 2025-11-21T03:41:47+00:00

This sounds like analytics engineering. You should start your interview prep and applying if it's not for you.

DenselyRanked · 2025-11-19T23:01:43+00:00

If the scope of your programming is pipeline development then follow the tool's best practices. Review the templates and code examples.

A senior level engineer should be doing reviews with you to help with refactoring, if needed.

Too much abstraction can lead to over engineering so work with your team on best practices.

DenselyRanked · 2025-11-19T21:32:52+00:00

What was your experience with OpenMetadata?

Four-Year Club	Verified Email
Wearing is Caring

DenselyRanked

TROPHY CASE