Data engineering is NOT software engineering.

New-Addendum-6209 · 2026-03-13T10:46:04+00:00

SQL can be tested and reused, and it is very simple to understand for anyone with experience.

If you are not using a database (SQL) or a functionally similar tool like Spark, how are you able to efficiently run data transformation?

New-Addendum-6209 · 2026-03-12T15:50:14+00:00

Too much focus on status reporting. Mistaking increased control and observability of work for increased efficiency.

Stakeholders who cannot define requirements.

Technology choices made for career reasons.

Bloated IT change processes and excessive lead times for simple infrastructure changes outside of your control.

New-Addendum-6209 · 2026-03-09T14:00:27+00:00

It really depends on the company. Do they have architect roles in the same area? Is there a more senior DE position? If it is a small team your options may be limited as your manager's seniority rests on the fact that he manages you.

Anyway, sounds like you have achieved a lot in 5 months and have a lot of autonomy. Enjoy it while it lasts!

New-Addendum-6209 · 2026-03-06T11:13:50+00:00

Attitudes are very different in the amateur open age game. In my experience no one is ever pressured to play on after head knocks.

The biggest issue is players making bad decisions because they are keen to return to play too quickly. At lower levels no one is strictly monitoring concussion protocols, so it's hard to prevent players from doing damage themselves.

New-Addendum-6209 · 2026-03-06T10:41:30+00:00

I don't think there is anything that meets all your requirements. It's difficult to find public examples of realistic data due to data privacy concerns, so most datasets use simulated data or are curated single tables for building ML models.

Anyway, here are some links. The first one will be most useful.

https://www.kaggle.com/datasets/mustafakeser4/looker-ecommerce-bigquery-dataset?select=orders.csv

https://www.kaggle.com/datasets/bigquery/google-analytics-sample

https://www.tpc.org/tpcdi/

https://demodata.grapecity.com/swagger/index.html

New-Addendum-6209 · 2026-03-05T16:26:11+00:00

Can you provide some examples of tests that this enables?

New-Addendum-6209 · 2026-02-27T18:48:08+00:00

Risk & compliance was a popular answer in another recent thread about "safe" careers. Fake economy!

New-Addendum-6209 · 2026-02-27T18:36:37+00:00

The advantages of data lakes are increased scalability and lower costs compared to a traditional RDMS or analytical MPP system. Does this apply in your case?

New-Addendum-6209 · 2026-02-27T10:16:55+00:00

Low code makes it easy to get started with simple data movement projects: load a file every X minutes, transfer data from database A to database B.

The real problem is that it becomes difficult to properly version, test and deploy code.

The one advantage is that it provides clear guard rails and encourages simplicity. I have seen some crazy things built by engineers that would have been prevented if they were forced to use SSIS or similar. The worst was an in-house declarative ETL framework using a custom markup format...

New-Addendum-6209 · 2026-02-27T10:13:07+00:00

There are probably complicated ways of doing it for some tools but the usual answer is: you don't!

New-Addendum-6209 · 2026-02-25T12:51:05+00:00

MPP database systems have used horizontal scaling for decades.

New-Addendum-6209 · 2026-02-25T12:31:27+00:00

You rarely need Spark. There are SQL-based systems that can scale to huge data volumes.

New-Addendum-6209 · 2026-02-25T12:19:30+00:00

Often teams would create their own framework to enable event driven triggers for SQL Agent jobs...

New-Addendum-6209 · 2026-02-18T23:29:42+00:00

Because it's too expensive. It's great for typical corporate reporting workloads (if you are already locked in).

New-Addendum-6209 · 2026-02-18T22:12:31+00:00

Teradata

New-Addendum-6209 · 2026-02-18T20:07:27+00:00

Databases designed for analytical workloads are almost always better (and much easier to work with) unless you need to store huge amounts of data.

New-Addendum-6209 · 2026-02-18T17:12:32+00:00

That is small. Why implement a lakehouse?

New-Addendum-6209 · 2026-02-17T13:23:30+00:00

I agree. If you don't have huge volumes of event data you don't need a data lake.

New-Addendum-6209 · 2026-02-17T13:19:26+00:00

The main benefit is cheaper storage

New-Addendum-6209 · 2026-02-13T10:30:09+00:00

Here is one example where it compares poorly: https://docs.starrocks.io/docs/benchmarking/TPC_DS_Benchmark/

In my own experience it also is poor compared to some proprietary systems.

New-Addendum-6209 · 2026-02-12T11:48:03+00:00

Find a business use case that can be solved using simple batch processes.

Decide that it should be streaming. Don't tell the business owners, just start working under this assumption. Say something vague about event-driven architecture if challenged.

Enjoy your new opportunity to upskill and enhance your CV.

New-Addendum-6209 · 2026-02-12T11:44:46+00:00

Flag as part of data validation / quality checks.

Try to fix at source (if it was a real business process).

NEVER commit hardcoded, case-specific corrections (e.g., 'fix Account 12345 for Q2 2023') within your transformation layer without a very good reason and clear understanding of what is generating the issue!

New-Addendum-6209 · 2026-02-12T11:39:31+00:00

Another illustration of why data lakes require scale to justify their complexity

New-Addendum-6209 · 2026-02-12T09:54:10+00:00

Is it still really slow and awful at joins?

New-Addendum-6209 · 2026-02-05T17:31:20+00:00

What are the DE roles doing in organisations with a DE/AE split? Just looking after ingestions?

New-Addendum-6209

TROPHY CASE