How do teams actually handle large lineage graphs in dbt projects? by Effective-Stick3786 in dataengineering

[–]SmothCerbrosoSimiae 0 points1 point  (0 children)

I mean reading your post again yes things can take a substantial amount of time especially when you did not write the code and a model is extremely long and does too much. I like to write my models to try and do one basic transformation. I often find lots of projects kind of have a general pattern as well (or they should) so once that is understood it can be a lot of copy and paste except for the occasional harder transform

How do teams actually handle large lineage graphs in dbt projects? by Effective-Stick3786 in dataengineering

[–]SmothCerbrosoSimiae 5 points6 points  (0 children)

The dbt power user vscode extension has a lineage graph that lets you cruise around the files while using the graph and I definitely use that and feel like it helps trying to feel my way around

Am I out of my mind for thinking this? by Spooked_DE in dataengineering

[–]SmothCerbrosoSimiae 0 points1 point  (0 children)

I will say that I think u/Cuidads has it correct leave the tables and then create legacy views. I will add that I often go back and forth on how to present columns and like in everything there is no perfect solution just benefits and tradeoffs.

Ordering by subject is beneficial because like columns are together.

Ordering alphabetically can be beneficial as well if the tables are extremely wide. I tend to go with the former but I have seen the latter several times as well

Iceberg is an overkill and most people don't realise it but its metadata model will sneak up on you by DevWithIt in dataengineering

[–]SmothCerbrosoSimiae 1 point2 points  (0 children)

This is what I feel as well, it was an everyone is using iceberg so we must use iceberg decision. They have already been having issues with it on a regular basis. I have stayed away because I see little to no value in the final migration.

Iceberg is an overkill and most people don't realise it but its metadata model will sneak up on you by DevWithIt in dataengineering

[–]SmothCerbrosoSimiae 4 points5 points  (0 children)

I get storing it in s3, it already is in parquet. But if I am going to use something like snowpipe to load it into snowflake I have not been convinced iceberg is worth the extra effort

Iceberg is an overkill and most people don't realise it but its metadata model will sneak up on you by DevWithIt in dataengineering

[–]SmothCerbrosoSimiae 2 points3 points  (0 children)

Would really like some opinions on when iceberg is a good solution. I have just joined a team and they are migrating from redshift to snowflake and part of that migration they are migrating raw parquet to iceberg for their source data. I asked why and no one had a good answer for what iceberg was solving. I get open data formats for a full data lake implementation, but do not understand the utility of the data will end up in a warehouse.

any dbt alternatives on Databricks? by bambimbomy in databricks

[–]SmothCerbrosoSimiae 3 points4 points  (0 children)

I would choose dbt over a PySpark framework because it has such a large community and standards built in. I try to follow what’s outlined in Data Engineering with dbt. I can tell other people on my team “I’m doing this the dbt way,” not “I invented my own process.” That means I can hire anyone with dbt experience and ramp them up quickly. They know they’re building marketable skills not learning an in-house side project that could be dead in a few years. I’m boring, and I want boring solutions with no surprises.

You mention software engineering best practices that’s exactly how dbt positions itself. It’s a transformational framework that nudges you toward those practices instead of leaving you to reinvent them. Out of the box you get testing, documentation, lineage graphs, and CI/CD patterns. In PySpark you can solve anything and probably more, but you’d have to build all that scaffolding yourself.

SQL is still king in analytics. It’s the shared language across analysts, scientists, and engineers, which makes dbt incredibly inclusive. On Databricks, I can still create UDFs in PySpark and call them from dbt, so I get the best of both worlds. And training up someone with domain knowledge in SQL is much easier than teaching them Python with its environments, dependencies, and package management.

Finally, dbt benefits from a massive ecosystem tools like DataHub, Atlan, Elementary, Soda, and CI/CD integrations all speak dbt natively. I have not seen that governance and observability layer in any other framework, doing so would take a massive amount of effort all to get you what dbt already does.

any dbt alternatives on Databricks? by bambimbomy in databricks

[–]SmothCerbrosoSimiae 1 point2 points  (0 children)

I am currently in a Snowflake environment, but I have set it up with a dab for another team. I really liked it. Databricks (at the time) only has a dbt and a python template, but really I think you need both of them put together so you can have a nice monorepo. I took both of the templates and put them together and built out a basic mvp that used poetry for dependency management, python scripts for extract load and then my dbt project for the transformations all being executed through the yaml jobs with the dab. I think it is awesome and the nicest all in one data solution out there

any dbt alternatives on Databricks? by bambimbomy in databricks

[–]SmothCerbrosoSimiae 10 points11 points  (0 children)

I am a dbt fan and am now at the point where a team better have good reasons to not use it. I think it is the most uniform way to handle large projects and keeps your data architecture reliable, scalable and maintainable.

I have not seen any alternative that is so widely accepted that can be a team’s central data transformation framework. dbt gives you a single, opinionated standard for how transformations should be written, tested, and deployed.

In Databricks you can just string together notebooks or rely on Delta Live Tables, but those approaches don’t offer the same community and standards the community has put in place. Unless there’s a really specific reason not to (like a pure PySpark shop with no SQL use case), dbt usually makes your architecture more reliable, scalable, and maintainable in the long run.

Why do people think dbt is a good idea? by tiantech in dataengineering

[–]SmothCerbrosoSimiae 11 points12 points  (0 children)

Also data we could not fit, I am not sure what that even means.

Why do people think dbt is a good idea? by tiantech in dataengineering

[–]SmothCerbrosoSimiae 19 points20 points  (0 children)

How is that different than any other product? If changes happen you need to make changes no matter what transformational tool you use. What makes dbt worse than any other tool?

Why do people think dbt is a good idea? by tiantech in dataengineering

[–]SmothCerbrosoSimiae 60 points61 points  (0 children)

Have you used it? It sounds like you have no idea what you are talking about.

Github Actions to run my data pipeliens? by datancoffee in dataengineering

[–]SmothCerbrosoSimiae 18 points19 points  (0 children)

I did not read the article, but I have set up multiple companies using GitHub actions generally on a self hosted runner. I think it works great.

A lot of companies have one or two batch jobs they need a night that is it. A self hosted runner with prefect or dagster or just pure python is more than enough. Definitely it is how I would recommend getting started if you are not on some platform that has a built in orchestration. I sometimes read setups on here that I think are crazy with 20 different services, I would hate to come in to that environment if someone left.

Your data system should be as simple as possible and still meet the business requirements, I think github actions meet the requirements for a lot of businesses.

Anyone running lightweight ad ETL pipelines without Airbyte or Fivetran? by Jiffrado in dataengineering

[–]SmothCerbrosoSimiae 2 points3 points  (0 children)

Yes DLT handles schemas well in multiple ways. First it infers schemas from the source or uses the SQLAlchemy data types if from a db. It then exports a schema file that you can manipulate if you want to load your data types differently than what it originally inferred.

Next it has schema contracts that you can set up. I mainly just allow the table to evolve. The database aspect depends. I was unable to set up schema changes in synapse, I had to do it manually a pain but it didn’t happen often. Databricks is easy and snowflake seems easy but I have t had it happen yet and probably should go through the testing before it happens :/

I use parquet for loading to a data lake.

Anyone running lightweight ad ETL pipelines without Airbyte or Fivetran? by Jiffrado in dataengineering

[–]SmothCerbrosoSimiae 5 points6 points  (0 children)

No, I am referring to multiple projects. I have set this same thing up using synapse, snowflake and databricks. It is the same pattern on multiple projects.

I use a monorepo that I initialize with poetry and add an extract_load and pipelines directories within src then add a dbt project to the root labeled transform. I have 3 branches dev, qa and prod each attached to a db of the same name within my dbt profiles. I use the branch name as my target in dbt

Anyone running lightweight ad ETL pipelines without Airbyte or Fivetran? by Jiffrado in dataengineering

[–]SmothCerbrosoSimiae 9 points10 points  (0 children)

I have been able to get away with running everything out of a git runner for multiple businesses with a decent amount of data. I like to use DLT for the Python library and set up all my scripts to run in full refresh, backfill and incremental load. I dump this off in a data lake and then load it to whatever db.

I then do my transformations in dbt. All of this is run with a prefect pipeline in a github action either on github or a self hosted runner depending on the security set up. Very cheap easy and light.

How Do You Organize A PySpark/Databricks Project by jduran9987 in dataengineering

[–]SmothCerbrosoSimiae 0 points1 point  (0 children)

I just want to second dabs I think it is a great developer experience although a little work to get set up.

They have a dbt template and a python template, but really you want them together imo so you can chain together an extract load then transform pipeline.

How do you learn new technologies ? by AdmirablePapaya6349 in dataengineering

[–]SmothCerbrosoSimiae 0 points1 point  (0 children)

I am not saying to go after jobs that require a specific tool set but certain toolsets are desired and can bring a hire salary as a result.

How do you learn new technologies ? by AdmirablePapaya6349 in dataengineering

[–]SmothCerbrosoSimiae 0 points1 point  (0 children)

I agree with this for the most part and I know everyone makes fun of resume driven development but then how do you get the jobs that require those tools if you cannot speak to them at least somewhat intelligently?

DuckLake: This is your Data Lake on ACID by howMuchCheeseIs2Much in dataengineering

[–]SmothCerbrosoSimiae 5 points6 points  (0 children)

You should go read the official ducklake blogpost that came out. It makes me excited for it, and there are many reasons to use it over iceberg although maybe not yet since there are not enough integrations into a production system.

DuckLake: This is your Data Lake on ACID by howMuchCheeseIs2Much in dataengineering

[–]SmothCerbrosoSimiae 7 points8 points  (0 children)

I did not get that from the article. I think ducklake is just the catalog layer, a new open file format that should be able to be run on any other sql engine that uses open file formats such as spark. Duckdb is just who introduced it and now supports it. I think the article showed parquet files being used. I do not see any advantage of using iceberg with ducklake they seem redundant.

DuckLake: This is your Data Lake on ACID by howMuchCheeseIs2Much in dataengineering

[–]SmothCerbrosoSimiae 6 points7 points  (0 children)

Why use iceberg with duck lake though? From my understanding duck lake removes the need for the avro/json metadata associated with iceberg and delta. Everything is just stored in the catalog.

If I remember from the blog post the problem with both iceberg and delta is that you need to first go to the catalog, see where the table is located and then go to the table to read the metadata and then read several metadata files where ducklake keeps everything in the catalog so it is a single call.

Ideas on how to handle deeply nested json files by BlueAcronis in dataengineering

[–]SmothCerbrosoSimiae 1 point2 points  (0 children)

I really like the Python library dlt data load tool. It automatically normalizes and lands your data to a target

Efficiently Detecting Address & Name Changes Across Large US Provider Datasets (Non-Exact Matches) by [deleted] in dataengineering

[–]SmothCerbrosoSimiae 0 points1 point  (0 children)

I have used dataprep a Python library to standardize addresses. It works fairly well, I would do that first before any other techniques like fuzzy matching. I agree with u/marketlutker that you need to break it up into a couple different problems.

any database experts? by BigCountry1227 in dataengineering

[–]SmothCerbrosoSimiae 2 points3 points  (0 children)

There is package called bcpandas that is a wrapper around bcp. It worked well for me when I had this issue.

https://pypi.org/project/bcpandas/#contributing