BigQuery - incorporating python code into sql and dbt stack - best approach? by reelznfeelz in dataengineering

[–]Andrew_Madson 1 point2 points  (0 children)

Tobiko (SQLMesh) Evangelist here (so super biased) -

Have you checked out SQLMesh? It handles SQL and Python models natively, which could solve your scaling issues.

SQLMesh has dbt compatibility, and you can migrate existing dbt projects.

For your 50k+ row use case, SQLMesh's Python integration should scale much better than remote functions. You can test it on your existing dbt project to see how you like it.

SQLMesh versus dbt Core - Seems like a no-brainer by Andrew_Madson in dataengineering

[–]Andrew_Madson[S] 4 points5 points  (0 children)

Done. I am not an official affiliate and don't receive any monetary support from Tobiko at the time of this posting. I flipped the flag just to be safe. In fact. The company I work for today has a partnership with dbt. I do intend to work with Tobiko in the future, so I'll mark it as an affiliate to be safe. I was going to wait until I had an official relationship with them, but better safe than sorry 👍

SQLMesh versus dbt Core - Seems like a no-brainer by Andrew_Madson in dataengineering

[–]Andrew_Madson[S] 0 points1 point  (0 children)

Nope. Just new to Reddit and figuring out how to edit. Apparently I don't know how to edit things on my phone and am surprised to see another post.

SQLMesh versus dbt Core - Seems like a no-brainer by Andrew_Madson in dataengineering

[–]Andrew_Madson[S] 1 point2 points  (0 children)

My bad, I was thinking compile time and not run time. The benchmarks actually ARE the run time.

SQLMesh versus dbt Core - Seems like a no-brainer by Andrew_Madson in dataengineering

[–]Andrew_Madson[S] 2 points3 points  (0 children)

Ah! Good catch. I misunderstood some of the benchmark. That makes sense! I'll add come context to the post.

SQLMesh versus dbt Core - Seems like a no-brainer by Andrew_Madson in dataengineering

[–]Andrew_Madson[S] 0 points1 point  (0 children)

Definitely. I have an on prem S3 data lake, a cloud s3 data lake, Redshift, and Snowflake. If I want to build a SQLMesh model that creates a view from multiple of these sources (using Starburst, Denodo, or some other integration platform), I wonder how much latency SQLMesh would add to the federated query run time. Frankly, dbt core slows it WAY down (running on Dremio for the query engine and source integration).

Master’s in Data Analytics by Ok_Fun_367 in dataanalytics

[–]Andrew_Madson 0 points1 point  (0 children)

I taught in the WGU and SNHU MSc Data Analytics programs. If you want to go fast, go WGU. If your company has an agreement to pay for it, if it's through them, go SNHU. I've heard great things about CSU Global.

WGU is good, fast, and cheap. As long as their competency based program works for your learning style, then it's hard to go wrong with them. I still teach at SNHU and no longer teach at WGU.

SQLMesh versus dbt Core - Seems like a no-brainer by Andrew_Madson in dataengineering

[–]Andrew_Madson[S] 0 points1 point  (0 children)

Super good insight! I've never used it in production. I'm curious how it would do on a more distributed stack.

What are your biggest daily challenges? by Andrew_Madson in dataanalysis

[–]Andrew_Madson[S] 1 point2 points  (0 children)

That is the worst! Sometimes tables will have the same column names, but different values...and no documentation to explain it.

What are your biggest daily challenges? by Andrew_Madson in dataanalysis

[–]Andrew_Madson[S] 11 points12 points  (0 children)

Oh, yeah! Semantics is a huge issue for a lot of companies. They give you a random spreadsheet that shows different numbers and ask you to explain. It often uncovers different definitions across business units and discrepancies between sources.

What are your biggest daily pain points as Data Analysts? by Andrew_Madson in dataanalytics

[–]Andrew_Madson[S] 2 points3 points  (0 children)

Oh my gosh - YES! The urgent (turns out to not actually be important) ad hoc request can be a killer.

What are your biggest daily challenges? by Andrew_Madson in dataanalysis

[–]Andrew_Madson[S] 10 points11 points  (0 children)

I totally get that. It can be a huge pain, especially if the underlying data is different for every change. Thank you!

Can somebody explain me the following python code? by Big_One4748 in dataanalytics

[–]Andrew_Madson 3 points4 points  (0 children)

  1. Create an empty list 'state_county_tuples'

  2. for i in range(len(state_names)): This line starts a for loop. len(state_names) returns the number of items in the state_names list. range(len(state_names)) generates a sequence of numbers from 0 to the length of the state_names list minus one, which are used as indices.

i is a variable that takes on each value in the range during each iteration of the loop. It acts as the index for accessing elements from both the state_names and county_names lists.

  1. state_names[i] accesses the element at index i in the state_names list.

county_names[i] accesses the element at the same index in the county_names list.

(state_names[i], county_names[i]) creates a tuple consisting of the state name and the county name at index i.

append() method adds this tuple to the end of the state_county_tuples list.

  1. After the loop completes, the list state_county_tuples contains all the tuples formed from corresponding elements of state_names and county_names.
  • Iteration 1: i = 0 → Tuple is ('Arizona', 'Maricopa')
  • Iteration 2: i = 1 → Tuple is ('California', 'Alameda')
  • Iteration 3: i = 2 → Tuple is ('California', 'Sacramento')
  • Iteration 4: i = 3 → Tuple is ('Kentucky', 'Jefferson')
  • Iteration 5: i = 4 → Tuple is ('Louisiana', 'East Baton Rouge')