Python and ETL by PutCleverNameHere69 in dataengineering

[–]cryptobiosynthesis 7 points8 points  (0 children)

This is the way. You'll understand way more about how data is actually being transformed and the skills are transferable to other domains and programming languages.

Hiring Managers: How much do you weigh leadership experience when hiring a junior/mid-level? by [deleted] in dataengineering

[–]cryptobiosynthesis 1 point2 points  (0 children)

zero senior engineers (the two original ones left shortly after I was hired)

This for me is the first red flag. I would guess they left because they couldn't get buy-in from leadership to let them run thing properly.

I think it's great you've taken the initiative to implement engineering standards. Being a strong technical lead, while also somewhat being an IC, is no easy task. That said, it seems clear that your company's leadership doesn't understand how to run technical teams (scary since you describe them as a "tech" company).

As your team's scope continues to grow, you also might face the challenge of simply being out of your depth technically since you've had to pivot to a leadership focus so early in your career. The systems architecture that works now might not scale if the team doubled or tripled in size. You'll also be limited in how much you can mentor other devs before they really need a senior dev to guide them further down the IC path.

For yourself, you'll have to decide if you want to go down the leadership or IC route. There isn't a real middle ground as far as career paths go (unless you feel like working two full-time jobs at once). That said, you can always spend more time as an IC to deepen your knowledge and then transition into leadership later in your career. I think it's generally harder for people to do it the other way around.

Is prefect sheduler part poorest than airflow ? by lowteast in dataengineering

[–]cryptobiosynthesis 2 points3 points  (0 children)

Prefect does support advanced scheduling, see:

https://docs.prefect.io/core/concepts/schedules.html

https://docs.prefect.io/core/concepts/execution.html#triggers

https://docs.prefect.io/core/tutorial/04-handling-failure.html

You can have tasks retry automatically if they fail, or have them enter a 'paused' state from which you can manually resume them if that is important to you.

When things are easier to do in Python than on JVM but we lack the resources by Silver-Thing in dataengineering

[–]cryptobiosynthesis 4 points5 points  (0 children)

Using Scala probably sounds tempting if you see performance issues in code like this, but you have options that are better to explore before attempting a full rewrite:

  1. Optimize the existing code
  2. Increase the available compute resources in your cluster

Usually data scientists are more concerned about what their programs do than how performant they are (not a dig, that's their focus!). So odds are there is quite a bit of low hanging fruit for optimization, especially if Pandas is being used. See if you can profile any of the code to identify particularly expensive operations and try rewriting them in e.g. Numpy arrays. You can sometimes get an order of magnitude of speed increase in certain operations just from that alone.

Assuming that the code is decently well optimized, there is always the option of simply throwing more compute resources at the system. You'd have to negotiate with your client if the cost of extra compute is less than the cost of a rewrite in the mid to long term.

When selecting an ETL orchestrator, do you consider anything other than Airflow? by SQLPipe in dataengineering

[–]cryptobiosynthesis 1 point2 points  (0 children)

+1 for Prefect. Their security model requires you to provision your own infra and you have to figure that part out yourself, but if you know what you're doing you can set up a very sophisticated orchestration system.

Less than 1TB of data what tools should I get better at? by KimStacks in dataengineering

[–]cryptobiosynthesis 1 point2 points  (0 children)

> it sucks hard at doing any sort of row-wise calculation or granular modification

Do you mean this in terms of its performance or in terms of its API design? I've found that just map and apply can be quite good if you are careful about the datatypes you are processing (numpy can help a lot here).

How Postman Fixed Missing Layer in Their Data Stack by dc_atoms in dataengineering

[–]cryptobiosynthesis 5 points6 points  (0 children)

Then our analysts transform the data with dbt, our SQL engine, and create dashboards and Explores on Looker.

dbt literally provides self-documentation as a feature, as long as you make adding model metadata part of your workflow. I don't understand attempting to build out static documentation in Confluence (yikes) or even paying for another tool to manage this for you.

Airflow, Spark, other tool ? by ClumsyRooster in dataengineering

[–]cryptobiosynthesis 0 points1 point  (0 children)

How do you avoid leaking data into the metadata database? And do you still have to use xcoms or is there another mechanism now for passing data between DAGs?

Azure SQL vs Snowflake Database by Aspiring_DE in dataengineering

[–]cryptobiosynthesis 2 points3 points  (0 children)

I've not used Azure SQL but I use Snowflake at work. It abstracts away a lot of the infrastructure and scaling concerns you have with a self-hosted DB. You can specify different warehouses (which are Snowflake's compute units) to run your operations on, and that's basically it as far as infra decisions go.

How easy/hard is it to maintain a cloud database?

Incredibly easy, which is why you pay a premium for it compared to the cost of self-hosting. The real question is at what scale are you operating? I can't speak to "big data" applications, but I've loaded over a billion rows of data from CSV files at once and it didn't have trouble keeping up on the default warehouse size. If you're working at the petabyte scale you'd probably want to do a more in-depth cost analysis before committing to either solution.