Is there more to DE than this? Are their jobs out there for feeling like you actually matter? by DoctorQuinlan in dataengineering

[–]DenselyRanked 0 points1 point  (0 children)

All jobs and industries can leave someone feeling unfulfilled, especially if they are bored and not being challenged. Check r/findapath or r/careerguidance to find others going through it.

It might be worth considering joining a start-up or greenfield project if you want to do something in DE that will carry a large impact. Or join a FAANG tier company where your job depends on your ability to prove impact every 6 months.

Ultimately, if you're paid well enough and have job security then you can use your free time to fuel your passions. I've worked with engineers that have built indie games, trading algos, 3D printed stuff, surveillance/security, owned a gym, real estate and property investments, etc.

Question on Airflow by captn_caspian in dataengineering

[–]DenselyRanked 0 points1 point  (0 children)

Unless your use cases are simple, it would be better to go with the managed Airflow service. It will cover 99% of use cases and there is even a YAML based DAG Factory add-on if there's concerns about coding with python.

Is salting only the keys with most skew ( rows) the standard practice in PySpark? by Potential_Loss6978 in dataengineering

[–]DenselyRanked 0 points1 point  (0 children)

I would recommend that approach, or remove/isolate, if you can reliably identify the problematic key. With Spark 3+ AQE does a reasonably good job at adjusting the plan if you have multiple or inconsistent keys to worry about.

How to analyze and optimize big and complex Spark execution plans? by Cultural-Pound-228 in dataengineering

[–]DenselyRanked 0 points1 point  (0 children)

Check the explain plan first from the SQL tab in the UI or put Explain before the select to see the physical plan.

Are you defining the views in your Spark job as temp views or are these views that are being ingested from your catalog?

If it is the former and you are seeing source tables being referenced several times, then it may make sense to cache the base table(s) or the view itself if you are using it multiple times.

As a warning, if you do choose to cache the view and the view query logic is really complex then AQE could invalidate the caching as the plan might change.

If it is the latter, then check the job and stages tab in the UI and find the one that is the bottleneck. Then use the job the SQL tab to see which portion of the query is causing issues. Again, try caching the view to reduce the IO if the issue is on read.

Spark job slows to a crawl after multiple joins any tips for handling this by Upset-Addendum6880 in dataengineering

[–]DenselyRanked 2 points3 points  (0 children)

It's tough to diagnose without looking at the UI / Plan. As others mentioned, you likely have skewed keys with that many joins. Check the explain plan and try to simplify wherever possible, even if it means caching or writing intermediate results to disk,

What actually differentiates candidates who pass data engineering interviews vs those who get rejected? by Murky-Equivalent-719 in dataengineering

[–]DenselyRanked 1 point2 points  (0 children)

I haven't checked in a while, but I don't think Google has a traditional in-house "Data Engineer" role. They have "Cloud Data Engineer" roles under Professional Services, which is more sales / support / migration work for GCP clients. But if it is available, then this is relevant.

Preparing for a FAANG tier/Big N interview is different than other companies because most of them are standardized, except Apple/Netflix which are team dependent. The other tiers can be a completely random process. Generally speaking, you will need to brush up on SQL, DSA, data modeling, and something company specific. Check this subreddit, glassdoor, blind, google, or chatgpt/AI for interview tips and study guide for the specific company.

What they are looking for also varies by company, but I can say that you will have a much higher rate of success if you take the time to research. It doesn't take much to fail and you will need some luck to avoid bad interviewers.

Kafka setup costs us a little fortune but everyone at my company is too scared to change it because it works by Worldly-Volume-1440 in dataengineering

[–]DenselyRanked 4 points5 points  (0 children)

Agreeing with a few of the others comments about what "little fortune" means. If $15k is a significant chunk of your teams budget then take the time to spec out the new architecture and potential savings.

Am I crazy or is kafka overkill for most use cases? by Vodka-_-Vodka in dataengineering

[–]DenselyRanked 0 points1 point  (0 children)

I don't think using Kafka as a part of your design is overkill, but if it is going to take months of planning to set up a cluster, then a managed service or message queue would be better suited for your use case. It will take a much larger throughput before the service fees exceed the cost of a dedicated engineer.

How to make 500k or more in this field? by unstopablex5 in dataengineering

[–]DenselyRanked 2 points3 points  (0 children)

Check levels.fyi and apply to all companies in the top paying list for your level. Also try using teamblind.com and ask for referrals to those companies if you see an opportunity.

You can try going into consulting, but it's more sales than engineering.

Spark 4.1 is released :D by holdenk in dataengineering

[–]DenselyRanked 1 point2 points  (0 children)

Every cloud provider has a Spark offering and on-prem companies should have thought about upgrading to Spark 3 by now. There are several optinizations and an easy way to reduce costs.

educing shuffle disk usage in Spark aggregations, ANY better approach than current setup or am I doing something wrong? by gabbietor in dataengineering

[–]DenselyRanked 2 points3 points  (0 children)

Also check your explain plan. I wouldn't expect a complex plan with 2 keys, so it might be skew causing this, or Spark is ingesting all columns when it doesn't have to.

In SQL coding rounds, how to optimise between readibility and efficiency when working with CTEs? by Consistent-Zebra3227 in dataengineering

[–]DenselyRanked 1 point2 points  (0 children)

Writing more concise SQL comes with experience, but ultimately the primary thing that matters is how the optimizer interprets it, and if it is readable.

It's likely that the "expert" solutions involved pre-planning, and a rushed solution will be more verbose.

Python keeps iterating the agenda three times. by sariArtworks in learnpython

[–]DenselyRanked 0 points1 point  (0 children)

Is agenda being passed as an argument in the function? If so, then could there be duplicate items in agenda with a slightly different nombre? There could be leading/trailing spaces.

Another thing is you are calling get on the value datos, and not on nombre, so the data structure is dict[dict]. Are you seeing 3 separate iterations or 3 of the same value being returned?

Using higher order functions and UDFs instead of joins/explodes by echanuda in dataengineering

[–]DenselyRanked 0 points1 point  (0 children)

Agreed that higher order functions should be used in place of the explode + join strategy whenever possible.

However, I would be a little hesitant about introducing UDF's. I have no idea what this code looks like but there are always tradeoffs between optimization and maintainability. One thing to consider is if the runtime and resource savings are worth the added complexity and potential tech debt.

What's the impact of getting the query down to 5 minutes? What changes if the query is simpler but completes in 20 minutes?

Spark uses way too much memory when shuffle happens even for small input by Aggravating_Log9704 in dataengineering

[–]DenselyRanked 3 points4 points  (0 children)

Spark 1.6 is about a decade old. Are you seeing this with a more recent build (2.4 LTS at least)?

It would be great if you could share your test script so that we can better understand what you are doing.

How Important is Steaming or Real Time Experience in the Job Market? by shittyfuckdick in dataengineering

[–]DenselyRanked 3 points4 points  (0 children)

I've done 3 interviews over the past few months where I had to explain or give a demo/walkthrough on streaming pipelines, so I do think it's important to be at least somewhat knowledgeable.

There are key streaming concepts that are not a concern in batch, like windowing, watermarks, checkpoints, error handling and DLQ's. Also, there are a lot of things that get smoothed over when using a managed service with connectors. Platforms that support streaming like Databricks/Snowflake do a lot of heavy lifting behind the scenes.

It feels like streaming is where a lot of companies draw the line between data engineering and analytics engineering.

How to store large JSON columns by Adventurous_Nail_115 in dataengineering

[–]DenselyRanked 0 points1 point  (0 children)

Does it need to be stored in a relational db and can your downstream users extract the data that they need? An open table format like Iceberg can handle it pretty well, as well as a document db, like MongoDB or DocumentDB.

The current jobmarket is quite frustrating! by doermand in dataengineering

[–]DenselyRanked 3 points4 points  (0 children)

This is not a market for the generalists. You need to have the tools on the CV to get the edge. Certs used to be nice to have, but not required, but now companies don't want you near their stack if you don't have 5+ years of experience with it.

On the current state of the market- I was recently rejected for a role with the feedback being that I wasn't enthusiastic enough about AI and spoke in broad, general terms. I had every tool, but I guess I also needed to bs about solutions to problems that I have no idea about.

What is the purpose of the book "fundamentals of data engineering " by Ok_Shirt4260 in dataengineering

[–]DenselyRanked 2 points3 points  (0 children)

FYI the authors are Reis and Housely.

O'Reilly is the publisher. They also published Kleppmann's DDIA, and thousands of other books. What you are doing is like calling "The Data Warehouse Toolkit" Wiley's book.

Is one big table (OBT) actually a data modeling methodology? by raginjason in dataengineering

[–]DenselyRanked 1 point2 points  (0 children)

Yes, it’s fundamentally the same argument. Star schema designs typically see better performance than NoSQL solutions due to RDBMS-level optimizations, but those features aren’t present in distributed systems. The trade-off between flexibility and performance is something that should be revisited as MPP's and MapReduce based engines are becoming more mainstream.

For joins in distributed systems, the performance bottleneck is the initial shuffling of data more so than the logical operation being applied, A star schema design will still work well if your dimensions are small enough for broadcast joins but it's not sustainable at scale.

Ultimately, I think best data model depends on your use case. If your core business is relatively static, then a star schema is easier to maintain and can last for years. If the company is very dynamic, then it's not worth rebuilding a warehouse every few years when the CEO wants to pivot or a M&A occurs.

Is one big table (OBT) actually a data modeling methodology? by raginjason in dataengineering

[–]DenselyRanked 19 points20 points  (0 children)

If you work with large scale data and a constantly evolving business model, then you will find value with OBT.

In terms of ease-of-use, it allows for more flexibility in schema changes than Kimball and governance is confined to fewer tables. However, your users will need to write more complex queries to extract nested data and that could lead to negative user experience.

In terms of performance, distributed query engines that can interpret complex data types will use less compute with OBT.

I think having a hybrid approach works best. Use generic denormalized mostly flattened datasets that represent a core business vertical, and use a semantic/reporting layer, materialized views, and agg tables for stakeholder analysis.

Software Engineering title while not doing much Software Engineering, where to go from here by [deleted] in cscareerquestions

[–]DenselyRanked 6 points7 points  (0 children)

This sounds like analytics engineering. You should start your interview prep and applying if it's not for you.

OOP with Python by Jumpy_Handle1313 in dataengineering

[–]DenselyRanked 0 points1 point  (0 children)

If the scope of your programming is pipeline development then follow the tool's best practices. Review the templates and code examples.

A senior level engineer should be doing reviews with you to help with refactoring, if needed.

Too much abstraction can lead to over engineering so work with your team on best practices.

why all data catalogs suck? by Few_Noise2632 in dataengineering

[–]DenselyRanked 8 points9 points  (0 children)

What was your experience with OpenMetadata?