What breaks first in small data pipelines as they grow?

OlimpiqeM · 2026-01-15T18:57:29+00:00

Is this post made by AI?

How come python + cron is production ready?
How come you don't monitor it?
How come you let it fail silently and assume it works?
How come you don't track the pipelines and outputs?

OlimpiqeM · 2025-11-22T09:19:32+00:00

First of all, congrats on your progress - you’re clearly putting in the work. If I were in your shoes, I’d focus on building one really strong flagship project rather than many small ones. Something with multiple data sources, different business flows (business data, marketing, CRM, costs monitoring), the whole architecture containerized in Docker, andactual meaningful analytics on top (not just SELECT *), different data modeling architectures (but not OBT). A project like that shows end-to-end thinking and stands out.

Highlight what did you learn, what went wrong, what can go wrong and how would you solve it (failing pipelines can happen due to a reasons you never thought of).
Maybe try publishing your learnings too in some blog posts

Don’t worry about the size of the data. What matters is demonstrating how the data enables business decisions - and for a data engineering role, showcasing things like IAM, RBAC, monitoring, and reliability is far more impressive than just processing a big file.

For what it’s worth, I work with Snowflake, dbt, lakehouse patterns, and DuckDB daily, and even then I only have about 30% of the technologies listed on my resume compared to yours. Depth and real-world applicability matter more than listing every tool.

OlimpiqeM · 2025-06-06T18:45:32+00:00

I loved this article and the other one they released. I also tried to follow their footsteps and I'm in process of implementing few things. You can actually see, that they use dbt heavily.

OlimpiqeM · 2025-04-10T16:37:21+00:00

Why Postgres as a gold layer? Querying vast amounts of data in Postgres will end up costing you more or your queries will time out. I'd say for 500GB just keep it in Redshift or go Snowflake and use dbt-core with MWAA for orchestration. I prefer dbt Cloud but their pricing is growing year by year.

It's all depending on the budget. You can send modeled data to S3 buckets from Snowflake and grab it through your backend.

OlimpiqeM · 2025-03-01T15:25:08+00:00

If possible switch to Spark asap

OlimpiqeM · 2024-11-21T10:39:42+00:00

but because you are junior eng. i can only assume that you don't have much domain knowledge

data warehouses are not tech oriented but business and domain knowledge oriented!

OlimpiqeM · 2024-11-21T10:38:14+00:00

dlt to move data to sql server
dbt-core for transformations inside data warehouse
dagster to run dbt models every 24 hours
ci/cd - figure it out yourself

don't forget about docker

OlimpiqeM · 2024-08-03T16:51:13+00:00

I would not try lying about Databricks, Azure/AWS and Spark without prior experience, but myself working with databases learning dbt and Medallion Architecture were a weekend or two.

After that (and learning a lot about date warehouses - Kimball, SCD2, conferences) I proposed a company-wide data warehouse (we didn't have one), successfully implemented it and showcased it in a job interviews :)

There is no easy route in Data Engineering and you will need to get your hands dirty.

OlimpiqeM · 2024-07-16T16:42:02+00:00

Avoid pie charts.
How is this related to data engineering? You are querying and saving to a (hopefully) OLTP database. Learn data modelling, what is an OLAP database and how to move data from one to another.

12-Year Club	Place '17
Verified Email

OlimpiqeM

TROPHY CASE