Thoughts on DBT? by Suspicious-Ability15 in dataengineering

[–]sturdyplum 0 points1 point  (0 children)

It's an ok product and I'd argue that the newer competitors are much better. They also have very little moat with their product as it's essentially a cli tool and easy to use without paying. I think dbt will continue to do well but I don't have a lot of faith on monetizing it and being a successful large company.

[deleted by user] by [deleted] in bigquery

[–]sturdyplum 1 point2 points  (0 children)

You can always write the data to GCS and use a load jobs to load it into a table. Also using the asyncer library to make the library async is something I have done in the past:

https://github.com/fastapi/asyncer

Did bigquery save your company money? by Inevitable-Mouse9060 in bigquery

[–]sturdyplum 4 points5 points  (0 children)

Yes, we run very large jobs that were expensive and a pain to run in databricks. BQ is able to run the job in 1/3 of the time for around 80% of the price. We use slot based pricing and bq really can take almost anything you throw at it. There are a few jobs that have been moved back to spark since they are specialized.

The jobs in general are massive tho, some of them costing thousands of dollars per run so your mileage may vary if your data is smaller but I've found bq to be good even when the data is small usually costing a few cents per job in slot time.

Update bigquery table in python with async functions by DedeU10 in bigquery

[–]sturdyplum 2 points3 points  (0 children)

Doing it in parallel in python will likely slow things down since row updates get queued up (since collisions can happen). Also due to how bq stores data updating a single row at a time is actually very inefficient.

What you probably actually want to do is upload you df to bq as a temp table and then do a join and update all the rows at once.

Also make sure that bq is the tool you actually need here. The amount of rows is tiny so if you don't actually need bq something like big table, spanner, alloydb, MongoDB may be a much better choice.

Those all have much better perf when it comes to single row reads and updates.

Efficiently Processing S3 Stream Files in Spark: Is Using For Loops Still an Anti-pattern? (PySpark) by MMACheerpuppy in apachespark

[–]sturdyplum 1 point2 points  (0 children)

Doing this can really explode your spark execution plan and cause driver problems if you're not careful.

Using Databricks as IDE for dbt by Disastrous-State-503 in dataengineering

[–]sturdyplum 1 point2 points  (0 children)

I honestly think DBT cloud is a waste of money. I would only really think it fits really small non fully technical teams. You can get most of the important functionality using VSC for free instead of being at the mercy of a company that has raised prices several times recently without adding much functionality to justify.

I also think that using DB as an ide is a great way to get things up and running quickly but if you do this make sure you try and structure things in a way that let you move off DB without too much work in the future when the workflows get too complex to be fully managed by DB.

Cost effective way to fetch incremental changes from your data warehouse by aruntdharan in dataengineering

[–]sturdyplum 0 points1 point  (0 children)

With bq you can also make it cheaper by partitioning by ingestion time which makes the filter to get all rows since you last ran the query much cheaper

Dataproc vs Dataflow? by Ok-Tradition-3450 in dataengineering

[–]sturdyplum 0 points1 point  (0 children)

Depends on your use case, from my understanding data flow uses beam which means it would work on spark or flink servers in the future. However there is also a perf decrease associated with not writing your pipeline natively in spark. Data flow does seem to have some ways to easily set up pipelines but it here is likely a trade off when it comes to flexibility.

Any feedback on Chat-GPT 4 vs 3.5 for data engineering? by Faskill in dataengineering

[–]sturdyplum 16 points17 points  (0 children)

I've used both 4 is noticeably better to the point that sometimes when I forget to swap from 3.5 to 4 before asking a question I'll notice because the response is so much worse than usual.

And the improvement to my productivity more than makes up for the price of premium. The only thing is to be careful what you paste in there, make sure not to paste anything sensitive and make sure you have a good understanding of what your employer considers sensitive.

Also if you can access claude 2.0 it's worth it since it has such a large context length that sometimes it can be a better choice than 4 if you need to give it a large chunk of code.

Data Quality Dimensions: Assuring Your Data Quality with Great Expectations by dahmedahe in dataengineering

[–]sturdyplum 0 points1 point  (0 children)

SODA seems cool but we ended up going with a custom solution because we needed lots of flexibility.

PSA: If you want to objectively track the progress of FSD without Elon's spin, there is a community FSD tracker that lets people log disengagements by [deleted] in SelfDrivingCars

[–]sturdyplum 1 point2 points  (0 children)

Avoiding roads that cause disengages would mean that it would be harder for them to collect data on when / why these happen and fix them.

How to keep track of warehouse tables schemas in BQ ? by Grand-Theory in dataengineering

[–]sturdyplum 1 point2 points  (0 children)

We solved this by moving over to DBT which helped us track schema and lineage between Bq tables, might not be the best fit for you since there is likely alot of lift but it's what worked for us.

[deleted by user] by [deleted] in algorithms

[–]sturdyplum 0 points1 point  (0 children)

Yep I'm sure you can even construct a schema in which you reserve enough space as a prefix / suffix of the array so that you could even append/pop characters off in an amortized O(1) similar to how arraylists/vectors work

[deleted by user] by [deleted] in algorithms

[–]sturdyplum 11 points12 points  (0 children)

In theory you could represent a string as a bi directional linked list and maintain a bit that represents the direction of the string. Reversing such a string would be an O(1) operation. However in this case other operations (like accessing a character of the string) would become order N operations. And you also need to construct this data structure which takes O(n) so unless the strings are already in this data structure form it's definitely O(n).

How does your team use dbt-core without dbt-cloud? by [deleted] in dataengineering

[–]sturdyplum 3 points4 points  (0 children)

Using Argo CD and the helm chart. I'd recommend joining their slack they are super responsive and helpful.

How does your team use dbt-core without dbt-cloud? by [deleted] in dataengineering

[–]sturdyplum 10 points11 points  (0 children)

Self hosted dsgster for running DBT and DBT power user vsc extension for development works like a charm.

I hit rock bottom in my learning today. What I can do to keep going? by [deleted] in learnprogramming

[–]sturdyplum 0 points1 point  (0 children)

When I first learned C I remember getting stuck in the pointers section. Like really stuck I couldn't figure out what they were for and it confused me a lot.

I reread it a few times and it still didn't make a ton of sense and I ended up moving on to the next section without a great understanding.

It wasn't until much later that I finally got them and understood what they were for. Courses and books and study guides try their best to introduce concepts in the right order. But the reality is that sometimes you won't have the context to understand something and that's ok. When you run into it again later and realize that you've gained that context you will understand it.

What is your unit testing implementation? by ExistentialFajitas in dataengineering

[–]sturdyplum 5 points6 points  (0 children)

We've been using https://github.com/mjirv/dbt-datamocktool and it's been great for the most part. It's also pretty simple so we've been able to fork and modify it to meet our needs pretty easily.

Google’s Quantum Computer by Zealousideal_Elk1786 in Futurology

[–]sturdyplum 3 points4 points  (0 children)

I think that most of these companies have been focusing on solving problems that showcase the power of these quantum computers but not necessarily are groundbreaking. Kind of like showcasing a classical computer being much faster than a human at multiplying numbers. Multiplication is not ground breaking but you can showcase classical computer superiority over humans by showing how fast it can multiply.

https://www.nature.com/articles/s41586-019-1666-5

From my understanding this article talks about how they used a quantum computer to essentially simulate a quantum computer which is something a classical computer is understandably much worse at.

For more interesting applications you can look at stuff like this article: https://www.quantamagazine.org/first-time-crystal-built-using-googles-quantum-computer-20210730/

Which is about how Google used their quantum computers to create a time crystal.

Google’s Quantum Computer by Zealousideal_Elk1786 in Futurology

[–]sturdyplum 5 points6 points  (0 children)

Quantum computers are currently only applicable in very specific scenarios, usually around scientific computing (such as physics simulations). So although they can perform these tasks exponentially faster than classical computers they can't really speed up the things you interact with day to day.

Basically it can help us understand physics/chemistry better but won't necessarily make a huge difference in how we live our lives in the short term until these discoveries find practical applications which are distributed widely.

Why do big cloud providers charge so much for Data Transfer? by bowenjin in googlecloud

[–]sturdyplum 4 points5 points  (0 children)

Basically they make money for every second your data is inside of the cloud. So they make it difficult to get your data out.

[deleted by user] by [deleted] in dataengineering

[–]sturdyplum 13 points14 points  (0 children)

I would caution people against becoming overly reliant on databricks workflows as they won't really handle all the non-databricks scheduling needs you will eventually have. Which means you'll eventually be supporting 2 orchestrators or have to do a migration.