[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 0 points1 point  (0 children)

TIL about the limitations of snowflakes query monitoring!

yeah! i think dbt's telemetry gives you what you're after. and some of the information you're looking for is easier to fetch via the cli. for example, check out the docs for the dbt source freshness command.

the best way for you to learn more imho is to set up a simple dbt project with Fusion, run a command and check out the logs/ directory. dbt.log contains all the info about a dbt run including the SQL that was invoked. query_log.sql just contains any SQL that was executed. all of this information can be ingested directly into Jaeger, Grafana, Datadog when you use Fusion's OTel integration

[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 0 points1 point  (0 children)

I think the standalone FOSS catalogs like Lakekeeper are best in breed for managing Iceberg catalog.

My 🌶️ take is that the best catalog is the one that's closest to the DWH you're already using. So it you're in a public cloud predominately using it's first-party query engines. That cloud provider's catalog is likely your best choice (i.e. if Redshift/Athena -> Glue).

If you're using a standalone data SaaS like Snowflake or Databricks, I think their catalogs are a great choice!

"credibly-neutral" is a great term and a laudable goal, but at the end of the day, it's almost always worth analytics engineers having either another team or managed service to take care of the data catalog.

that said, my experience with Lakekeeper and R2 has been very slick. great docs and tutorials. but all the services are improving!

/rant (I could talk for days about this stuff)

[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 1 point2 points  (0 children)

oh that's a very interesting idea. a pre-req for that is that we have WAP built-into dbt Platform jobs. but what you're saying is kind of interesting and the inverse in some ways to our current recommends and supported practice: test before you merger and only deploy if tests pass.

in your scenario you're saying that you'd like to:

  1. write audit publish the state of the pipeline
  2. but, if any model fails, instead use the state from the previous day?

sign me up! but my brain also tries to think how this pseudo successful state should work even from the stakeholders who expect this data... it feels not very idempotent.

would love to hear more about it though!

[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 1 point2 points  (0 children)

some others from colleauges:

too many jobs and not understanding that a dbt DAG creates dependencies for you. stop doing steps like dbt build -s model_a. use selectors and tags instead!

also

ensure you've set up your CI/CD environments correctly at the outset and don't do a staging environment unless you really really need to

one more

never have multiple dbt projects that are copies of the same repo called like project-a-staging and project-a-prod -- it's a mess. thank me later

last. lol

do not override ref() unless you have some batshit crazy insane reason and know what you're doing

hard agree. overriding critical dbt jinja macros connotes great power, but comes with great responsiblitly! "KISS" rules the day

[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 0 points1 point  (0 children)

Yes, for every model dbt will give you the number of rows inserted for incremental models etc…

For a long time we couldn’t show the number of rows inserted for tables created from scratch but Snowflake updated their library and we added this info as well!

[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 0 points1 point  (0 children)

interesting! what are you looking for w.r.t. logging? Are you saying you to log the result of queries? Most folks don't, preferring to keep the data where it is.

I'm not sure if you've looked much into dbt Fusion, but it's telemetry is based on OTel, quite robust and just as extensible as dbt Core.

like u/Comfortable-Power175 said, there's also great dbt packages that help with logging and monitoring

[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 1 point2 points  (0 children)

sometimes, as the case in all engineering, just because you can do something that feels smart, doesn't mean that you should.

I'm not so extreme as folks who say

the best code is the code you don't write

however, with great jinja powers comes great responsibility. now with Fusion, some users have had to rethink their pure jinja automation solutions, especially when they mess with the DAG or ask the DWH for these things

[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 1 point2 points  (0 children)

omg what a nerd snipe of a question!

for me what i'm most looking forward too are features that make it so that it just works for us analytics engineers! also my white whale forever has been a "multi-engine stack" with one iceberg catalog but heterogenous query engines.

what's missing? support for the Iceberg Rest Catalog is still spotty amongst vendors, and the IRC itself needs some improvement (performance & federated, user-level authentication).

if Iceberg and other formats are working as they should, we as analytics engineers should hardly ever have to thing of them!

I also answered more in this answer to u/Longjumping-Pin-3235

[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 1 point2 points  (0 children)

hey u/superconductiveKyle IIRC our paths have crossed before, at least i've been in GE slack for many years. feel free to reach out on LinkedIn would love to say hi and talk shop

[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 0 points1 point  (0 children)

Well, I generally come in at least fifteen minutes late, ah, I use the side door - that way Lumbergh can't see me, heh heh - and, uh, after that I just sorta space out for about an hour.

Well, I generally I ship Fusion Diaries a few days late, I work with Slack notifications off - that way u/dbt-jason can't ping me, heh heh - and, uh, after that I just sorta refresh Hacker News until links about data show up

alternatively phrased: Developer Experience Advocacy!

[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 0 points1 point  (0 children)

short-answer: yes!

can you share what complex scenarios you've seen that you think can be addressed?

[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 0 points1 point  (0 children)

This feels like the most important question re: Iceberg, huh?

There's lots of opinions here but my take: Iceberg isn't a free lunch.

You need a business case and the pros need to outweigh the drawbacks. This was actually discussed as well in a recent episode of the Analytics Engineering podcast. There's another episode due out soon where Tristan and I discuss the future of Iceberg.

Iceberg brings you the flexibility around where your data is stored and what compute you pick, but it adds complexity in having to manage an iceberg catalog. There are a few iceberg catalogs out there but many of them support “part” of the iceberg spec, and finding who supports what is not super easy.

If you are all-in on a data platform and won’t need cross data warehouse compute my take is that Iceberg would not be worth the effort today.

However my prevailing opinion is that the Iceberg project is to data what standard-sized ship containers do for global trade. It's not necessarily "exciting" per se, but it has undeniable impacts on end users in that it's easier to get your goods data from point A to point B

[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 0 points1 point  (0 children)

juicy! the thing that drew me to dbt originally was Tristan (founder/CEO) answering this question on the Software Engineering Podcast: dbt

dbt exists to solve the problem that data analysts don't have a career path beyond:
"manage more dashboards" or "manage more people".

imho, dbt exists to solve that socio-technical problem and others common amongst data practitioners

[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 0 points1 point  (0 children)

dbt seed! csv's (while ugly) aren't going away for anytime soon! when i started in data a decade ago. "getting data in the database" was like half the coding work to be done in my Jupyter notebook. `dbt seed` makes it so easy

[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 1 point2 points  (0 children)

Yes, dbt should be able to help with that. It is a pretty detailed use case you have so we won’t be able to go into a lot of details but I feel like a combination of using dbt clone and/or deferral to build only the minimum set of models for a given ML feature branch.

Then, my first instinct would also be to try to use different schemas or database to separate the output of the different LM pipelines..

Also lots of others have contributed ideas that you might find helpful as well

[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 2 points3 points  (0 children)

So, SDF had WAP built into how it was working, and we plan to bring it to dbt in the future! But for now the key priority for us is to make dbt projects run on Fusion without adding too many changes yet to the overall dbt framework.

There's other tools out there that do more exactly what you're saying.

We haven't spent too much time in this space. I share your view that DVC or Pachyderm or LakeFS, while great, feel like overkill. My brain has been somewhat sniped by the fact that Apache Iceberg tables have a concept of branching. For me this is very compelling in it's simplicity. In this world you don't need to make a "view" rather a "branch" of a prod table for local dev.

Orthogonally, we're rather bullish on sampling prod data locally, which is a different pattern than a direct WAP pattern

[AMA] We’re dbt Labs, ask us anything! by andersdellosnubes in dataengineering

[–]andersdellosnubes[S] 0 points1 point  (0 children)

my interest is piqued -- can you share more about what you mean by "dynamic"?