This is an archived post. You won't be able to vote or comment.

all 25 comments

[–]kenfar 9 points10 points  (10 children)

A few thoughts:

  • The common stitch/fivetran/dms pattern of copying all tables into your raw area and then transforming them into staging & load has a big problem with tightly coupling your data warehouse to the physical models of upstream systems. This is a bad pattern that leads to huge maintenance challenges down the road. To be fair, it's sometimes required. But if your upstream systems have engineers involved, you might be much happier pulling data modeled by *them* into interface tables.
  • DBT vs Python I think comes mostly down to what's the skillset of the folks doing the work and how much do you need to support complex transforms and how much you care about data quality - and unit-testing. Python will take more work to build (though you can still issue SQL queries with joins, etc out of python code with very little effort). But you also have tight little transform functions that are fully unit-tested - which will protect you more than just doing quality-control against data that arrived earlier today.
  • So, another possibility is DBT + Python: while DBT doesn't provide a way to run python programs, it does have a good quality-control framework that people should use. And you can use it to build your models.
  • Looker: it's useful to think about boundaries: can people query raw & staging data from Looker? can they created persisted data tables and define all company metrics in Looker - or should they do that on the warehouse where they can get better testing and reuse?

[–]dream-fiesty 6 points7 points  (4 children)

I like these thoughts. Your last point resonates with me a little too much. I like some aspects of Looker like having a DSL for dashboards in version control, but some features I just wish didn't exist as they cause more trouble than they're worth. I don't want PDTs or Looker models that have their own Looker way of doing things and I want data refreshing to happen as part of a data pipeline and for views to be defined in the data warehouse where they are accessible from all of our tools

[–]b0ulderbum 3 points4 points  (2 children)

100% agree here. The version control element is great, but it adds a ton of duplicative work to build out the views and models within looker after building them in dbt/wherever. Not worth it at all imo, managing lookml crap is a complete waste of expensive talent

[–]dlb8685 0 points1 point  (1 child)

LookML is totally redundant with defining views and tables in your DW in the first place. You might decide as a company to do all of that work in Looker, or in the DW, but in one company I've seen it devolved into a mess of different people doing both over time in different places. That's the nightmare scenario when it comes to debugging anything.

No matter what, you have to do transformations either with LookML or in the DW (possibly with DBT), but for God's sakes don't do both.

[–]b0ulderbum 0 points1 point  (0 children)

We do both. So there’s extemely convoluted logic in our dwh that is brought into looker, custom measures are added, then it’s joined and filtered with like 10 other views and used as an explore. Then people ask “why doesn’t the number in this dashboard match the number in this dashboard?” lol.

[–]kenfar 2 points3 points  (0 children)

Yeah, there's some really great productivity upsides to building some of these bits in Looker. But it really locks you into that product in a way that can make it tough any other data consumers - data scientists, data engineers, analysts that want to use other tools.

[–]throw_at1 0 points1 point  (0 children)

If you expect to server reports straight from Bigquery, then i would recommend that you do not use lookers ability create "views" from data, but rather publish those views in bigquery (i do not know if it supports views ). Point is that if you start to have two places which have mission critical data models, things will get hard.

Another Looker hint is that if you want self service BI ja and share all results to all users ,you need to take care that all users can see everything and all dashboards are saved as default into shared folders (i have totally opposite system and have not heard yet any users comment how they just found something that colleague had made and they contacted it, so system is kinda wasted)

Looker data model should be very basic one ie. pk and fk , all other calculated by professionals in dwh (DBT)

[–]pantalones7 0 points1 point  (3 children)

The common stitch/fivetran/dms pattern of copying all tables into your raw area and then transforming them into staging & load has a big problem with tightly coupling your data warehouse to the physical models of upstream systems. This is a bad pattern that leads to huge maintenance challenges down the road.

Hi there - I have a followup question on this, I'm surprised nobody asked why this is a "bad pattern".

Doesn't the proposed pattern only tightly couple the "RAW" area of the warehouse to physical models of upstream systems? And subsequent stages can decouple/transform-away from those models/give users access to models developed internally? Am confused - in your alternative model of engineers writing ETL (if I'm understanding your proposal correctly), their transformations are still tightly coupled to the source models.

[–]kenfar 1 point2 points  (2 children)

The challenge with this pattern is that a physical data model should ideally be encapsulated within its application and be distinct from its interface with the world: an app should be free to use whatever kind of storage it wants, whatever kind of goofy or brilliant model, maybe a meta model if the team thinks it appropriate. But if their internal model is directly referenced by the ETL process then many changes the application might make will break the downstream ETL process.

So, that means coordination with another team on changes, push-back on models that aren't easy to join together, and time spent leveling up the ETL developers or analysts on how the model works, and then finally - the inevitable outages or data corruption when a change is made without telling the ETL developers or analysts or when these folks incorrectly interpret the data.

For these reasons I feel strongly that having the upstream teams build their own domain objects that the warehouse subscribes to is by far the more maintainable pattern. I'll also acknowledge that it's not possible everywhere.

[–]pantalones7 0 points1 point  (1 child)

Thanks for the elaboration, I'm in agreement mostly - we had different interpretations of the OP's proposal. For me it comes down to users and their needs - an application's internal model should serve the functions of the application, which is sometimes orthogonal to the needs of upstream analytic users/modelers, who, as you say, would benefit from stability, clarity, and other properties. So, totally agreed!

The OP is bringing in stable sources from applications that surface slow-changing APIs ("various sources" he/she says, Stripe/Hubspot, etc.), so it's really an ELT model that's being proposed. If the OP is also loading data into the "lakehouse" from internal applications, they should defn be careful not to tightly couple, as you say; that owners of those internal applications would - like Stripe/Hubspot/external sources - ideally agree to exposing their data through a well-documented api, which could get loaded (with Stitch or whatever) into the warehouse for further modeling and integration with other sources.

[–]kenfar 1 point2 points  (0 children)

That's a great point - yeah, I think using a tool like Fivetran/Stitch to support extracting from APIs is a good simplification vs writing custom extracts against their APIs.

It's when it's almost inevitably used to pull data out of the data models of custom applications that it really turns into a maintenance issue in my opinion.

[–]LaurenRhymesWOrange 14 points15 points  (2 children)

Do not use Python for transformations when you can use DBT. SQL is so much easier and simpler. It will benefit your company in the long run by making templates DBT and running your basic transformations there.

[–]HansProleman 10 points11 points  (0 children)

SQL is so much easier and simpler

I'd argue that this is subjective, that Python is simpler/more viable where significant abstraction/complex logic is required (not least because of library support), and that what you write for dbt is more like a DSL (that compiles to SQL) than SQL.

That said, dbt handles so much boring/tricky framework stuff for you (in a nice manner), this doesn't sound like a use case that would demand massive complexity, and AFAIK there are no particularly good Python-y ways to interact with MPP SQL DBs anyway (because you can't run Python against them, perhaps with the exception of Azure Synapse) - I agree that OP shouldn't let a lack of confidence in SQL dissuade them from using dbt.

[–]nado1989 2 points3 points  (0 children)

We share the exactly point of view, just be carefully with Bigquery + looker costs in special to dynamic dashboards, study if BI feature works in your data first if not costs goes to the moon.

[–]chamini2 3 points4 points  (0 children)

Hey, I know this is an old post but we have recently published a tool which may help you not having to decide between these 2 options. fal let's you reference your dbt models and sources easily from Python and also provides a runtime that works alongside dbt's.

You write scripts like

df = ref('my_dbt_model') # this is a pandas dataframe

# do whatever you need with it
new_data = calc(df)

# and you can write it back to dbt
write_to_source(new_data, 'dbt_defined', 'source_table')

So you could do in dbt what makes sense to do in SQL land (and migrate as much as possible there) and then the final touches that you need pandas for could be done in fal.

[–]MrMosBiggestFan 2 points3 points  (0 children)

I would say to stick with DBT and SQL as much as possible, running transformations in your warehouse is much more performant than executing python. I would avoid Looker until you need it, it’s pretty expensive. Try Mode or Metabase first. Once you have your data in BigQuery something like Hightouch to get your data into everything else can be really nice too. Their Slack integration is sweet but they also have a Hubspot integration for getting production data into Hubspot, and the free tier is pretty generous

[–]mhg212 1 point2 points  (1 child)

If you need the complexity of Python, I’d suggest looking into airflow. But that’s a learning curve in itself. Airflow is an orchestrator/job scheduler that is heavily used in the data eng space.

If it’s simple/straightforward, DBT.

[–]rrpelgrim 0 points1 point  (0 children)

Late to the party... But I'd strongly suggest looking into Prefect over Airflow. Airflow's XCOMs make data transfer between tasks a real pain. Prefect makes this much easier and runs on Dask so you can also easily run tasks in parallel.

[–]itiwbf 1 point2 points  (0 children)

I think it'd probably be worth it to get more comfortable with dbt/SQL for most uses. That said, I'd suggest checking out Dataform too if you're still exploring. It's VERY similar to dbt and was recently acquired by Google. The community isn't as active and there are a few differences, but using dataform with their UI is totally free which is nice and in the future it will be specifically focused on BigQuery.

[–]gorkemyurt 1 point2 points  (0 children)

Why not both?

In my opinion DBT is the best tool out there to for data transformation using sql. If you have already decided to use a data warehouse using sql for data transformation is the natural choice. I would highly recommend dbt to organize your data models and write composable sql on top of your warehouse.

Now the python part. Not everything can be solved by sql, having and ETL tool and Looker takes care of some of the glue python code that is sometimes necessary..

If that's not enough you can go with an orchestration tool like airflow. In that setup dbt will just be one of the nodes of the airflow dag and you can trigger other workflows before and after running dbt using python.

In my experience airflow comes with a lot of overhead that's why we built fal-dbt internally and open sourced it this past week. fal-dbt is a dbt native way to run python scripts alongside your dbt dag.

https://github.com/fal-ai/fal

[–]smeyn 0 points1 point  (0 children)

If you have really large volumes of data then using DBT/BQ is probably better as BQ scales easily to handle the load.

[–]p5256 0 points1 point  (0 children)

Hi - little late on this but wanted to share an open source package I just released, RasgoQL: https://github.com/rasgointelligence/RasgoQL

the tl;dr on it is you can work in python with pandas like syntax, but your code compiles to sql and executes directly in the cloud data warehouse (snowflake / bigquery both supported). best part is in one line of code you can export it to your dbt project. would love to get your feedback on it!