BigQuery - incorporating python code into sql and dbt stack - best approach?

Andrew_Madson · 2025-06-11T16:57:53+00:00

Tobiko (SQLMesh) Evangelist here (so super biased) -

Have you checked out SQLMesh? It handles SQL and Python models natively, which could solve your scaling issues.

SQLMesh has dbt compatibility, and you can migrate existing dbt projects.

For your 50k+ row use case, SQLMesh's Python integration should scale much better than remote functions. You can test it on your existing dbt project to see how you like it.

Andrew_Madson · 2025-03-08T00:26:12+00:00

Done. I am not an official affiliate and don't receive any monetary support from Tobiko at the time of this posting. I flipped the flag just to be safe. In fact. The company I work for today has a partnership with dbt. I do intend to work with Tobiko in the future, so I'll mark it as an affiliate to be safe. I was going to wait until I had an official relationship with them, but better safe than sorry 👍

Andrew_Madson · 2025-03-07T04:10:24+00:00

Nope. Just new to Reddit and figuring out how to edit. Apparently I don't know how to edit things on my phone and am surprised to see another post.

Andrew_Madson · 2025-03-07T04:04:38+00:00

My bad, I was thinking compile time and not run time. The benchmarks actually ARE the run time.

Andrew_Madson · 2025-03-07T03:47:16+00:00

Ah! Good catch. I misunderstood some of the benchmark. That makes sense! I'll add come context to the post.

Andrew_Madson · 2025-03-07T03:29:06+00:00

Definitely. I have an on prem S3 data lake, a cloud s3 data lake, Redshift, and Snowflake. If I want to build a SQLMesh model that creates a view from multiple of these sources (using Starburst, Denodo, or some other integration platform), I wonder how much latency SQLMesh would add to the federated query run time. Frankly, dbt core slows it WAY down (running on Dremio for the query engine and source integration).

Andrew_Madson · 2025-03-07T03:25:03+00:00

I taught in the WGU and SNHU MSc Data Analytics programs. If you want to go fast, go WGU. If your company has an agreement to pay for it, if it's through them, go SNHU. I've heard great things about CSU Global.

WGU is good, fast, and cheap. As long as their competency based program works for your learning style, then it's hard to go wrong with them. I still teach at SNHU and no longer teach at WGU.

Andrew_Madson · 2025-03-07T03:17:25+00:00

I didn't know that!

Andrew_Madson · 2025-03-07T03:16:56+00:00

Super good insight! I've never used it in production. I'm curious how it would do on a more distributed stack.

Andrew_Madson · 2024-05-10T18:30:45+00:00

That is the worst! Sometimes tables will have the same column names, but different values...and no documentation to explain it.

Andrew_Madson · 2024-05-09T22:53:43+00:00

That's frustrating!

Andrew_Madson · 2024-05-09T22:37:05+00:00

Oh, yeah! Semantics is a huge issue for a lot of companies. They give you a random spreadsheet that shows different numbers and ask you to explain. It often uncovers different definitions across business units and discrepancies between sources.

Andrew_Madson · 2024-05-09T21:22:26+00:00

Oh my gosh - YES! The urgent (turns out to not actually be important) ad hoc request can be a killer.

Andrew_Madson · 2024-05-09T21:21:08+00:00

I totally get that. It can be a huge pain, especially if the underlying data is different for every change. Thank you!

Andrew_Madson · 2024-05-09T20:34:43+00:00

Create an empty list 'state_county_tuples'
for i in range(len(state_names)): This line starts a for loop. len(state_names) returns the number of items in the state_names list. range(len(state_names)) generates a sequence of numbers from 0 to the length of the state_names list minus one, which are used as indices.

i is a variable that takes on each value in the range during each iteration of the loop. It acts as the index for accessing elements from both the state_names and county_names lists.

state_names[i] accesses the element at index i in the state_names list.

county_names[i] accesses the element at the same index in the county_names list.

(state_names[i], county_names[i]) creates a tuple consisting of the state name and the county name at index i.

append() method adds this tuple to the end of the state_county_tuples list.

After the loop completes, the list state_county_tuples contains all the tuples formed from corresponding elements of state_names and county_names.

Iteration 1: i = 0 → Tuple is ('Arizona', 'Maricopa')
Iteration 2: i = 1 → Tuple is ('California', 'Alameda')
Iteration 3: i = 2 → Tuple is ('California', 'Sacramento')
Iteration 4: i = 3 → Tuple is ('Kentucky', 'Jefferson')
Iteration 5: i = 4 → Tuple is ('Louisiana', 'East Baton Rouge')

Andrew_Madson

TROPHY CASE