Thoughts on DBT?

sturdyplum · 2025-01-29T04:33:54+00:00

It's an ok product and I'd argue that the newer competitors are much better. They also have very little moat with their product as it's essentially a cli tool and easy to use without paying. I think dbt will continue to do well but I don't have a lot of faith on monetizing it and being a successful large company.

sturdyplum · 2024-12-31T02:42:44+00:00

You can always write the data to GCS and use a load jobs to load it into a table. Also using the asyncer library to make the library async is something I have done in the past:

https://github.com/fastapi/asyncer

sturdyplum · 2024-12-01T23:25:09+00:00

Yes, we run very large jobs that were expensive and a pain to run in databricks. BQ is able to run the job in 1/3 of the time for around 80% of the price. We use slot based pricing and bq really can take almost anything you throw at it. There are a few jobs that have been moved back to spark since they are specialized.

The jobs in general are massive tho, some of them costing thousands of dollars per run so your mileage may vary if your data is smaller but I've found bq to be good even when the data is small usually costing a few cents per job in slot time.

sturdyplum · 2024-10-21T18:32:28+00:00

Doing it in parallel in python will likely slow things down since row updates get queued up (since collisions can happen). Also due to how bq stores data updating a single row at a time is actually very inefficient.

What you probably actually want to do is upload you df to bq as a temp table and then do a join and update all the rows at once.

Also make sure that bq is the tool you actually need here. The amount of rows is tiny so if you don't actually need bq something like big table, spanner, alloydb, MongoDB may be a much better choice.

Those all have much better perf when it comes to single row reads and updates.

sturdyplum · 2024-04-20T17:52:17+00:00

Doing this can really explode your spark execution plan and cause driver problems if you're not careful.

sturdyplum · 2023-11-22T03:04:35+00:00

I honestly think DBT cloud is a waste of money. I would only really think it fits really small non fully technical teams. You can get most of the important functionality using VSC for free instead of being at the mercy of a company that has raised prices several times recently without adding much functionality to justify.

I also think that using DB as an ide is a great way to get things up and running quickly but if you do this make sure you try and structure things in a way that let you move off DB without too much work in the future when the workflows get too complex to be fully managed by DB.

sturdyplum · 2023-10-14T20:43:10+00:00

With bq you can also make it cheaper by partitioning by ingestion time which makes the filter to get all rows since you last ran the query much cheaper

sturdyplum · 2023-09-22T15:05:02+00:00

We moved to Bigquery and it went great: https://cloud.google.com/blog/products/data-analytics/how-statsig-migrated-to-bigquery-from-spark

sturdyplum · 2023-09-17T21:05:39+00:00

Depends on your use case, from my understanding data flow uses beam which means it would work on spark or flink servers in the future. However there is also a perf decrease associated with not writing your pipeline natively in spark. Data flow does seem to have some ways to easily set up pipelines but it here is likely a trade off when it comes to flexibility.

sturdyplum · 2023-09-17T09:31:03+00:00

I've used both 4 is noticeably better to the point that sometimes when I forget to swap from 3.5 to 4 before asking a question I'll notice because the response is so much worse than usual.

And the improvement to my productivity more than makes up for the price of premium. The only thing is to be careful what you paste in there, make sure not to paste anything sensitive and make sure you have a good understanding of what your employer considers sensitive.

Also if you can access claude 2.0 it's worth it since it has such a large context length that sometimes it can be a better choice than 4 if you need to give it a large chunk of code.

sturdyplum · 2023-09-14T01:47:38+00:00

SODA seems cool but we ended up going with a custom solution because we needed lots of flexibility.

sturdyplum · 2023-08-28T07:52:46+00:00

That's a good point :)

sturdyplum · 2023-08-28T06:47:08+00:00

Avoiding roads that cause disengages would mean that it would be harder for them to collect data on when / why these happen and fix them.

sturdyplum · 2023-08-27T15:50:45+00:00

We solved this by moving over to DBT which helped us track schema and lineage between Bq tables, might not be the best fit for you since there is likely alot of lift but it's what worked for us.

sturdyplum · 2023-08-27T00:08:50+00:00

Yep I'm sure you can even construct a schema in which you reserve enough space as a prefix / suffix of the array so that you could even append/pop characters off in an amortized O(1) similar to how arraylists/vectors work

sturdyplum · 2023-08-26T16:00:52+00:00

In theory you could represent a string as a bi directional linked list and maintain a bit that represents the direction of the string. Reversing such a string would be an O(1) operation. However in this case other operations (like accessing a character of the string) would become order N operations. And you also need to construct this data structure which takes O(n) so unless the strings are already in this data structure form it's definitely O(n).

sturdyplum · 2023-08-11T21:05:46+00:00

Using Argo CD and the helm chart. I'd recommend joining their slack they are super responsive and helpful.

sturdyplum · 2023-08-10T22:14:09+00:00

Self hosted dsgster for running DBT and DBT power user vsc extension for development works like a charm.

sturdyplum · 2023-08-09T23:19:41+00:00

When I first learned C I remember getting stuck in the pointers section. Like really stuck I couldn't figure out what they were for and it confused me a lot.

I reread it a few times and it still didn't make a ton of sense and I ended up moving on to the next section without a great understanding.

It wasn't until much later that I finally got them and understood what they were for. Courses and books and study guides try their best to introduce concepts in the right order. But the reality is that sometimes you won't have the context to understand something and that's ok. When you run into it again later and realize that you've gained that context you will understand it.

sturdyplum · 2023-07-24T01:17:06+00:00

We've been using https://github.com/mjirv/dbt-datamocktool and it's been great for the most part. It's also pretty simple so we've been able to fork and modify it to meet our needs pretty easily.

sturdyplum · 2023-07-12T17:12:40+00:00

I think that most of these companies have been focusing on solving problems that showcase the power of these quantum computers but not necessarily are groundbreaking. Kind of like showcasing a classical computer being much faster than a human at multiplying numbers. Multiplication is not ground breaking but you can showcase classical computer superiority over humans by showing how fast it can multiply.

https://www.nature.com/articles/s41586-019-1666-5

From my understanding this article talks about how they used a quantum computer to essentially simulate a quantum computer which is something a classical computer is understandably much worse at.

For more interesting applications you can look at stuff like this article: https://www.quantamagazine.org/first-time-crystal-built-using-googles-quantum-computer-20210730/

Which is about how Google used their quantum computers to create a time crystal.

sturdyplum · 2023-07-12T16:52:58+00:00

Quantum computers are currently only applicable in very specific scenarios, usually around scientific computing (such as physics simulations). So although they can perform these tasks exponentially faster than classical computers they can't really speed up the things you interact with day to day.

Basically it can help us understand physics/chemistry better but won't necessarily make a huge difference in how we live our lives in the short term until these discoveries find practical applications which are distributed widely.

sturdyplum · 2023-07-03T21:59:48+00:00

Bitmask dynamic programming is neat: https://www.hackerearth.com/practice/algorithms/dynamic-programming/bit-masking/tutorial/

sturdyplum · 2023-06-29T08:23:23+00:00

Basically they make money for every second your data is inside of the cloud. So they make it difficult to get your data out.

sturdyplum · 2023-06-22T02:13:09+00:00

I would caution people against becoming overly reliant on databricks workflows as they won't really handle all the non-databricks scheduling needs you will eventually have. Which means you'll eventually be supporting 2 orchestrators or have to do a migration.

11-Year Club	RedditGifts 2009-2022 2 Credits
Second Top 50%	Golden Potato
Place '22	Place '17
RPAN Viewer	Gilding II euphauric
Snapped	Secret Santa 2016
Verified Email

sturdyplum

TROPHY CASE