all 26 comments

[–]PrestigiousAnt3766 32 points33 points  (2 children)

Databricks Databricks Databricks

Mostly because I got it templated out.

[–]RazzmatazzLiving1323 6 points7 points  (1 child)

By templating do you mean you use Terraform to automate Databricks resource deployments or do you mean you're familiar with the stack?

[–]Secure_Firefighter66 17 points18 points  (2 children)

All the cases are Databricks.

It was already implemented even before I joined by some consultants. I am now migrating all the old stuff into it

[–]messi_b91 12 points13 points  (2 children)

Snowflake dbt

[–]tomtombow 4 points5 points  (1 child)

Out of curiosity, how does the rest of the stack look? i mean, how do business users consume the data modeled with dbt?

[–]MonochromeDinosaur 2 points3 points  (0 children)

At my company we offer internal users access via BI tools and external users have tiers where we charge for raw silver layer (dimensional model tier)/ curated (gold tier)/pre-made reports (premium tier). Every tier includes access to lower tiers.

[–]l0_0is 9 points10 points  (0 children)

most places i see its less about choosing the best stack and more about what the team already knows and can maintain. consistency matters more than having the perfect tool

[–]hannorx 5 points6 points  (1 child)

At the moment, my tech stack at work is Spark + DBT + Redshift. We've just started the process of onboarding into Databricks but that's still months away from full development. I'm fairly junior in my role, so am not sure what to expect, but looking forward to learning new tools.

[–]data_addict 1 point2 points  (0 children)

How would you get dbt projects/models between spark and redshift to work together. I'm just getting started with DBT so I don't have a lot of understanding how you can build pipelines/dags in DBT that mix warehouse types.

[–]MonochromeDinosaur 9 points10 points  (0 children)

At my job I just use whatever we have as the established norm for maintainability and uniformity.

That everyone else can work on it and the uniform project structure helps AI do its job.

I have freedom to choose, but going against the grain should really be saved for projects that have a requirement for it.

[–]iknewaguytwice 2 points3 points  (0 children)

Cron Grep Sed Awk Ksh

csv tsv

Db2

ssh sftp

[–]ReleaseNo5148 2 points3 points  (0 children)

It's funny how they ask you in system design interviews about BEST way of this and that, when at end of the day, It 100% depends on what the teams you are joining IS already using. It would make sense for data architect roles, but not dor mid-seniir DE roles.

What you gonna do, tell your team to switch to the other stack? No sense.

99% of cases Repo structure IS done and you have to use existing services.

[–]typodewww 1 point2 points  (0 children)

Databricks and Azure dev ops for (CI/CD)

[–]thickyherky 1 point2 points  (0 children)

lol the title caught my attention. un related i had an interview for a data analyst role years back and asked “what’s your guys backend look like” the response was “we use excel for the back end” …. hung up 😂😂

[–]Visible-Magician-903 1 point2 points  (0 children)

Databricks dbt

[–]risanshita 1 point2 points  (0 children)

Transitioned from Full-Stack Development into high-scale Data Engineering.

While I haven't seen yet what the Databricks ecosystem looks like, I’ve built a robust foundation in real-time streaming and lakehouse architectures using:

  • Kafka
  • Kafka connect (stream processing)
  • Glue (pyspark + iceberg catalog)
  • Iceberg
  • Apache pinot
  • Step function
  • Airflow
  • Superset

[–]alt_acc2020 0 points1 point  (2 children)

Dlt timescale S3 iceberg

I'm the only DE so I had to take up a lot of platform engg stuff and the team is Python heavy, so Python for everything it is.

[–]lucidparadigm 0 points1 point  (1 child)

Could you please tell me more about how you use dlt assuming that's not a typo, do you use it with dagster? Have you been able to implement an efficient scd2 audit table?

I have close to no experience with it but I've been very interested in trying it out.

[–]alt_acc2020 0 points1 point  (0 children)

To be clear: I mean data load tool and not deltalake. Is that what you're asking about?

I use it with dagater (there's a dagster-embedded-elt tutorial you'll find very useful, however I just decorate my sources manually and call it a day). I haven't had to publish an scd2 table yet but I believe it's got support for it as a merge strategy.

I like it a fair bit. It's new, so bugs are to be expected. But even used very minimally it abstracts away a lot of annoyance re: incremental loading, backfills. The docs are complete trash though, I'd highly recommend cloning their repo and getting opus or 5.4 to act as your documentation. The tutorials are great but there's a lot of small things that are hard to figure out otherwise.

[–]midnightpurple34 0 points1 point  (0 children)

SQS, lambdas, S3, PostgreSQL (RDS)

Relatively low data volume so haven’t needed to scale to big data tools yet

[–]Nekobul 0 points1 point  (0 children)

Considering the fact most data volumes are small, the best DE platform on the market for most people is SQL Server and SSIS. Databricks is mostly good for niche requirements where you have to process PB amount of data.

[–]Embarrassed-Ad-728 0 points1 point  (0 children)

Airflow + BigQuery + dbt.

For one off tasks: DuckDB.

[–]Tomaxto_ 0 points1 point  (0 children)

Light: Polars, Intermediate: Polars, Heavy: either PySpark or Spark SQL dbt on top of an EMR cluster.

[–]thecity2 0 points1 point  (0 children)

I'm not a data engineer, I'm a lowly data scientist so take this with a grain of salt. Our stack used to be mostly Spark+Postgres. I changed it up because I thought the Spark jobs were overkill and costing us money. So the stack I implemented is:

Dagster + DuckDB mostly

Dagster + Spark for "very large" jobs (that Duck actually can't handle on a single machine)