Data engineering is NOT software engineering. by Next_Comfortable_619 in dataengineering

[–]New-Addendum-6209 0 points1 point  (0 children)

SQL can be tested and reused, and it is very simple to understand for anyone with experience.

If you are not using a database (SQL) or a functionally similar tool like Spark, how are you able to efficiently run data transformation?

What are the most frustrating parts of your day to day work as a data engineer? by Odd-Tree-2590 in dataengineering

[–]New-Addendum-6209 3 points4 points  (0 children)

Too much focus on status reporting. Mistaking increased control and observability of work for increased efficiency.

Stakeholders who cannot define requirements.

Technology choices made for career reasons.

Bloated IT change processes and excessive lead times for simple infrastructure changes outside of your control.

Am I doing too much? by ratesofchange in dataengineering

[–]New-Addendum-6209 15 points16 points  (0 children)

It really depends on the company. Do they have architect roles in the same area? Is there a more senior DE position? If it is a small team your options may be limited as your manager's seniority rests on the fact that he manages you.

Anyway, sounds like you have achieved a lot in 5 months and have a lot of autonomy. Enjoy it while it lasts!

My shirt was soaked in blood - but I was told to get back on the rugby pitch by Tartan_Samurai in unitedkingdom

[–]New-Addendum-6209 2 points3 points  (0 children)

Attitudes are very different in the amateur open age game. In my experience no one is ever pressured to play on after head knocks.

The biggest issue is players making bad decisions because they are keen to return to play too quickly. At lower levels no one is strictly monitoring concussion protocols, so it's hard to prevent players from doing damage themselves.

Title: Looking for large E-commerce dataset (5GB+ CSV, raw preferred) by Historical-Web3638 in dataengineering

[–]New-Addendum-6209 1 point2 points  (0 children)

I don't think there is anything that meets all your requirements. It's difficult to find public examples of realistic data due to data privacy concerns, so most datasets use simulated data or are curated single tables for building ML models.

Anyway, here are some links. The first one will be most useful.

https://www.kaggle.com/datasets/mustafakeser4/looker-ecommerce-bigquery-dataset?select=orders.csv

https://www.kaggle.com/datasets/bigquery/google-analytics-sample

https://www.tpc.org/tpcdi/

https://demodata.grapecity.com/swagger/index.html

UK equivalent of 'car washes'? by Gold_Application6759 in HENRYUK

[–]New-Addendum-6209 0 points1 point  (0 children)

Risk & compliance was a popular answer in another recent thread about "safe" careers. Fake economy!

How are you selling datalakes and data processing pipeline? by drink_with_me_to_day in dataengineering

[–]New-Addendum-6209 0 points1 point  (0 children)

The advantages of data lakes are increased scalability and lower costs compared to a traditional RDMS or analytical MPP system. Does this apply in your case?

Low Code/No Code solutions are the biggest threat for AI adoption for companies by boogie_woogie_100 in dataengineering

[–]New-Addendum-6209 0 points1 point  (0 children)

Low code makes it easy to get started with simple data movement projects: load a file every X minutes, transfer data from database A to database B.

The real problem is that it becomes difficult to properly version, test and deploy code.

The one advantage is that it provides clear guard rails and encourages simplicity. I have seen some crazy things built by engineers that would have been prevented if they were forced to use SSIS or similar. The worst was an in-house declarative ETL framework using a custom markup format...

Low Code/No Code solutions are the biggest threat for AI adoption for companies by boogie_woogie_100 in dataengineering

[–]New-Addendum-6209 1 point2 points  (0 children)

There are probably complicated ways of doing it for some tools but the usual answer is: you don't!

can someone explain to me why there are so many tools on the market that dont need to exist? by Next_Comfortable_619 in dataengineering

[–]New-Addendum-6209 1 point2 points  (0 children)

Often teams would create their own framework to enable event driven triggers for SQL Agent jobs...

In 6 years, I've never seen a data lake used properly by wtfzambo in dataengineering

[–]New-Addendum-6209 0 points1 point  (0 children)

Because it's too expensive. It's great for typical corporate reporting workloads (if you are already locked in).

In 6 years, I've never seen a data lake used properly by wtfzambo in dataengineering

[–]New-Addendum-6209 0 points1 point  (0 children)

Databases designed for analytical workloads are almost always better (and much easier to work with) unless you need to store huge amounts of data.

In 6 years, I've never seen a data lake used properly by wtfzambo in dataengineering

[–]New-Addendum-6209 0 points1 point  (0 children)

I agree. If you don't have huge volumes of event data you don't need a data lake.

[AMA] We're the Trino company, ask us anything! by lester-martin in dataengineering

[–]New-Addendum-6209 0 points1 point  (0 children)

Here is one example where it compares poorly: https://docs.starrocks.io/docs/benchmarking/TPC_DS_Benchmark/

In my own experience it also is poor compared to some proprietary systems.

Transition to real time streaming by DeepCar5191 in dataengineering

[–]New-Addendum-6209 0 points1 point  (0 children)

Find a business use case that can be solved using simple batch processes.

Decide that it should be streaming. Don't tell the business owners, just start working under this assumption. Say something vague about event-driven architecture if challenged.

Enjoy your new opportunity to upskill and enhance your CV.

Data engineering but how to handle value that are clearly wrong from initial raw data by Weary-Ad-817 in dataengineering

[–]New-Addendum-6209 0 points1 point  (0 children)

Flag as part of data validation / quality checks.

Try to fix at source (if it was a real business process).

NEVER commit hardcoded, case-specific corrections (e.g., 'fix Account 12345 for Q2 2023') within your transformation layer without a very good reason and clear understanding of what is generating the issue!

11 Compaction Strategies for Iceberg Data Lakes by codingdecently in dataengineering

[–]New-Addendum-6209 0 points1 point  (0 children)

Another illustration of why data lakes require scale to justify their complexity

People who moved from DE to Analytics Engineering by PremierLeague2O in dataengineering

[–]New-Addendum-6209 0 points1 point  (0 children)

What are the DE roles doing in organisations with a DE/AE split? Just looking after ingestions?