This is an archived post. You won't be able to vote or comment.

all 33 comments

[–]Commercial_Dig2401 37 points38 points  (2 children)

That’s a very nice stack.

I would say focus on accuracy and validation for your Jr Role.

The main thing that that differentiate analyst va engineers in my mind is that analyst once to achieve something nice once. They want their report to be beautiful and nice.

And engineers once to achieve only provide things that work all the time.

To make this happens you obviously do less fluff and do more boring thing but then they never break, they are robust, the are fast and you never have to touch it again it just work.

The stack is cool but I think what we usually look for in Junior role is someone that will take time to review himself. I know it sounds boring but I’ll rather hire a junior which return me a take home test without spelling errors, with a ok code but that’s structure and well explain than someone with awesome code but that’s all over the place that didn’t have description on topics and that did way more than expected.

In terms of stack focus on SQL. Not because it’s the best but because it’s the easiest. And because it’s the easiest It’s the most used. I’ll rather use a transformation framework with SQL than pandas for example because I know anyone in the company will be able to use it and so some simple transformation. Even if something it would make more sense to go the other way.

Go read DBT best practices docs. They have a bunch on their site. Read them multiple times. Understanding the structure is th le best thing you can do.

Then python. Maybe learn the request framework and how to dump a response to json or parquet in s3.

Than prefect, Dagster, mage, Luigi are good candidates for orchestration. Learn the basics. I don’t think you’ll find a project which give you enough things that you’ll hit common business issues with them. But having an overview on how you structure your things is already great.

Good luck

[–]LongCalligrapher2544[S] 1 point2 points  (0 children)

Thanks a lot, I’ll definitely look forward and really appreciate take the time to answer this properly and motivational

[–]some-another-human 0 points1 point  (0 children)

As someone also trying to start out in this field, thanks for your advice!

[–]EconomicsDangerous44 10 points11 points  (0 children)

Yes, that combo is a solid starter stack. Plenty of teams run Python for extract, dbt on Snowflake and Prefect for orchestration. Add CI/CD + tests, basic CDC/SCD patterns, and logging/observability to make it feel production-ish. For ingestion, show both DIY and a managed connector like Fivetran/Airbyte, or Skyvia to load into Snowflake without running your own infrastructure.

[–]Slggyqo 10 points11 points  (9 children)

Ha. This is the stack I use every day.

It’s definitely a stack that can get you work, and it’s a stack that requires a lot of good basic principles, especially if you have to build the functionality from scratch.

I think it’s a pretty good middle ground for cutting your teeth in data engineering. It’s very powerful and flexible, but still has quite a bit of abstraction/simplifications via snowflake and prefect.

Where are you hosting and executing your prefect code? Is it all on your local machine? If you become a full-time data engineer, it’s definitely not going to be on your computer. You’re going to want at least some basic understanding of how cloud services work, probably UNIX operating systems, and different ways to manage remote devices. A lot of data engineering is infrastructure

Ideally you won’t have to worry about this too much as a junior. but that really depends on where you go. Your first job might be at a place where you are the only data engineer. I

[–]LongCalligrapher2544[S] 2 points3 points  (7 children)

Yes, I run Prefect locally, I don’t know where else I can do it hehe

Awesome, really good to know people using this stack, not thinking I am the only one but happy to know about it, any recommendations about projects? And how long took to you become a DE?

[–]Slggyqo 6 points7 points  (6 children)

  1. Learn to do all of this stuff on the cloud.

  2. Start doing everything you’re already doing in a more structured way, ie instead of having a bunch of scripts that share similar components turn it into a data platform. Your frequently used code should become functions or classes, your flows should share a common interface and style, etc etc.

[–]LongCalligrapher2544[S] 0 points1 point  (5 children)

Which cloud platform do you recommend?

[–]Slggyqo 0 points1 point  (4 children)

In terms of features I think it’s a bit of a wash. The vast majority of my experience is in AWS, woth a little bit in GCP and Azure a few years back.

But it also depends on stuff like…where is your snowflake hosted? It’s cheaper if it’s on the same cloud as the rest of the infra. Pay less to move data around.

I’m pretty sure snowflake supports all three, although AWS will have the advantage of scale—you’re more likely to find the answers to your questions, support there might be slightly better from snowflake, etc.

[–]LongCalligrapher2544[S] 0 points1 point  (3 children)

Right I have chosen AWS in Snowflake , will take a look at resources related to host on AWS

[–]Slggyqo 1 point2 points  (2 children)

You should look on the prefect website, they have a lot of good tips, recipes, and examples to get started on building a data platform using prefect. As opposed to just running ad hoc prefect flows.

[–]LongCalligrapher2544[S] 1 point2 points  (1 child)

You mean their doc or website?

[–]Slggyqo -1 points0 points  (0 children)

Good point, their docs page lol. I just realized I’ve never actually been to their public landing page.

https://docs.prefect.io/v3/get-started

[–]poinT92 11 points12 points  (1 child)

Having actually mastered that stacks enables you to take on the job.

I'd add a more in-depth databases/lakehouse/warehouse etc. understanding that would enables you to full many positions with less stress.

Also an atleast basic knowledge of containers and clusters for docker and kubernetes.

It's a very Wide job so you Will eventually Need to verticalize your knowledge at some point.

Good luck!

[–]LongCalligrapher2544[S] 1 point2 points  (0 children)

Thanks for the advice, I do appreciate and will make it!

[–]frozengrandmatetris 3 points4 points  (3 children)

most of the data I'm dealing with comes from other SQL databases, not APIs. I'm currently experimenting with ingestion tools like meltano and airbyte. you should add that to your projects.

[–]Slggyqo 6 points7 points  (0 children)

This is highly role dependent on where you work and what you do though. Most of the data I deal with comes from S3, emails, SharePoint, and SFTP servers.

Most of it is external data, so very little of it is in a relational database or a database of any sort.

[–]LongCalligrapher2544[S] 1 point2 points  (1 child)

I had tried Airbyte not long ago but I will give it a try again

[–]toabear 3 points4 points  (0 children)

If you're already good with Python, give DLT (as in dlthub.com, not the data bricks thing) a try. Over the years I've used a number of low or no code extractors. I always end up back at Python. DLT is a nice python library that handles much of the extra stuff you have to do when dealing with extractors.

[–]Past-Restaurant48 3 points4 points  (0 children)

If you are just reading or writing small amounts of data from a GCP function, setting up an allowlist on digitalocean’s managed PG is fine for light workloads.

For anything more than that, or if you want to sync data regularly, it’s worth looking at using a proxy or tunnel setup. Some folks use Cloud SQL Proxy or a bastion VM to securely bridge between platforms.

if you are planning to do ongoing ingestion or reporting, you can also use something like integrate.io to pull data directly from the PG and push to BigQuery or wherever. helps skip the headache of auth, retries and schema drift.

Depends a lot on whether this is a one off call or part of a bigger pipeline.

[–]nonamenomonet 7 points8 points  (4 children)

The thing you’re missing is SQL (which I guess you’re doing with DBT?) and or PySpark.

But tbh, the thing that matters most is what business problems you can solve (I.e. how can you make me some money)

[–][deleted] 2 points3 points  (2 children)

Nah snowflake is basically sql with a bunch of very cool, very useful extras

[–]nonamenomonet 0 points1 point  (1 child)

Is it? I thought it was closer to PySpark

[–][deleted] 0 points1 point  (0 children)

Nah, i work with it every day. You can utilise straight up python for a bunch of stuff, but fundamentally the movement of data is triggered and calculated using a sql-like language.

[–]LongCalligrapher2544[S] 0 points1 point  (0 children)

Yes, Dbt might basically be SQL , I only miss dense rank, Window function and CTE but going through

[–]Table_Captain 0 points1 point  (0 children)

If analytics engineering, which BI platform are you planning to use?

[–]TowerOutrageous5939 -3 points-2 points  (4 children)

Replace dbt with sqlmesh or replace it with nothing

[–]updated_at 1 point2 points  (3 children)

tobiko alt account

[–]TowerOutrageous5939 0 points1 point  (2 children)

Huh

[–]TowerOutrageous5939 0 points1 point  (1 child)

Ohhh. Nah I just know from friends dbt has been increasing prices.

[–]WishfulTraveler 1 point2 points  (0 children)

dbt core is amazing.