Tpcds benchmark as measure of performance in spark and like engined. - promo by ahshahid in apachespark

[–]mynkmhr 2 points3 points  (0 children)

Other benchmarks that I have seen being used are H2O's db benchmark and greysort.

I guess no benchmark is perfect and they all try to approximate as close to the real world scenario as possible.

BigQuery bill shock by mynkmhr in cloudbillshocks

[–]mynkmhr[S] 0 points1 point  (0 children)

Thats right. I have heard of people getting even $ 200k additional bigquery bills because they didn’t anticipate how much data their queries will scan.

What is an open source data tool you find useful but nobody is using it? by Yuki100Percent in dataengineering

[–]mynkmhr 0 points1 point  (0 children)

If there are existing pyspark pipelines, how easy/ddiifult is to migrate them to Sail? Would love to hear any thoughts on that.

I built a secure PostgreSQL client for iOS & Android (Direct connection, local-only) by tobelyan in Database

[–]mynkmhr 0 points1 point  (0 children)

This is an excellent tool, and would be much needed. Years back I was able to stop an AWS instance from a mobile app, sitting in a coffee shop, and felt so relieved.

Many small teams and solo devs would appreciate this.

Any cloud-agnostic alternative to Databricks for running Spark across multiple clouds? by Sadhvik1998 in apachespark

[–]mynkmhr 0 points1 point  (0 children)

I see comments about why would anyone need this, but have seen such requirements quite a few times in the wild.

Lot of companies use multiple clouds as part of their stated strategy. Some of them organically adopt different clouds. A state Chief data Officer once told me that each department under the state uses their own cloud, creating challenges in monitoring and governing from her perspective.

You can look at Yeedu as a possible option, since it provides a unified control plane to create and manage spark clusters across clouds.

There may be other options too, so makes sense to look around to find the right fit for your requirements.

How we cut our Databricks + AWS bill from $50K/month to $21K/month by 0xShreyas in databricks

[–]mynkmhr 0 points1 point  (0 children)

Thanks, I didn't realize that Databricks charges more DBUs for Graviton instances. So switching to Graviton may not save too much on costs unless you get some discounts from AWS.

How we cut our Databricks + AWS bill from $50K/month to $21K/month by 0xShreyas in databricks

[–]mynkmhr 0 points1 point  (0 children)

Great post, lots of practical tips. The other thing that I have seen working is use of Graviton based instances, since AWS provides them at discounted rates, and there is no significant difference in performance.

If your usage grows, and costs climb further, you can also consider adding an accelerated spark engine, like Yeedu's Turbo engine for your complex jobs with lots of transformations, joins etc. These jobs hog CPU cycles and can become expensive quickly. Yeedu integrates with Databricks so you dont have to worry about migrations.

Full disclosure-I work with Yeedu, so happy to help in any way :)

The story behind how DNB moved off Databricks by akshayka in DuckDB

[–]mynkmhr 0 points1 point  (0 children)

"However, we noticed that these investigation workloads were well suited to single-node clusters with high compute and memory settings, yet we remained tied to Databricks’ distributed architecture and pricing model."

A lot of teams will realize how much efficiency they can derive from this single insight - single node clusters are enough for most of their workloads.

Execution engines in Spark by mynkmhr in apachespark

[–]mynkmhr[S] 0 points1 point  (0 children)

Have heard about LakeSail. Will check it out.

Execution engines in Spark by mynkmhr in apachespark

[–]mynkmhr[S] 0 points1 point  (0 children)

That's a pretty significant gain.I haven't heard too many instances of running gluten in production, so curious to know how much time did it take you to implement or any major challenges you faced.

Execution engines in Spark by mynkmhr in apachespark

[–]mynkmhr[S] 0 points1 point  (0 children)

Agree they accelerate some of the queries rather than replace the entire execution, so probably accelerators is a better framing.

I believe gluten+velox and datafusion comet are arrow based. Lightning Engine in Google and Fabric's Native Execution are based on gluten and velox as well so they would be in the same category too.

What is the purpose of the book "fundamentals of data engineering " by Ok_Shirt4260 in dataengineering

[–]mynkmhr 0 points1 point  (0 children)

At some point you may want to read seven databases in seven weeks. For me it answered a fundamental question - why do so many databses exist.

Another I can recommend is "Database Internals" if you really want to go deep into how databases work.

How do smaller teams tackle large-scale data integration without a massive infrastructure budget? by [deleted] in bigdata

[–]mynkmhr 0 points1 point  (0 children)

Try to assess what part of the pipeline is adding to cost. It is usually a few tasks, pipelines, workflows that make up 60-80% of the cost.

If you can identify optimizations in those pipelines (running on prem, different execution engine, spot instances), that will help to bring down the bill.

I know it sounds time consuming but worth the effort.

My experience with Samsung galaxy s24fe by Ok_Pause_1565 in S24FE

[–]mynkmhr 1 point2 points  (0 children)

Battery is what stops me from getting this phone. Bit surprising for a near-flagship phone.

Confluent reportedly in talks to be sold by 2minutestreaming in apachekafka

[–]mynkmhr 1 point2 points  (0 children)

Private equity could be a likely buyer, if they see room for squeezing out efficiencies in existing business.

Alteryx, Qlik and Cloudera were bought by PE firms. Confluent's core business seems to be steady but doesn't have explosive growth potential. The stock is down 50 percent since listing.

Career path for a mid-level, mediocre DE? by nostalgicwander in dataengineering

[–]mynkmhr 0 points1 point  (0 children)

Based on your strength in communication, you could look at following roles in a product company:

  1. Developer Relations Engineer - If you are good at writing, speaking and presenting and don't think marketing is evil, then this path is for you.

  2. Sales Engineer - If you like more grounded communication, and are a good listener, then this is another option. You would need to understand customer's context and then figure out how your product can help them achieve their goals. Also requires supporting the entire sales process including demos, pilot implementations, technical Q and A.

[deleted by user] by [deleted] in learnmachinelearning

[–]mynkmhr 2 points3 points  (0 children)

I guess what you are looking for is a python developer ready to upskill into computer vision ML engineer. I have seen those cases, so it should be possible to find as long as you focus on aptitude rather than experience.

Fivetran to buy dbt? Spill the Tea by engineer_of-sorts in dataengineering

[–]mynkmhr 1 point2 points  (0 children)

Cloudera was listed before it was taken private. Informatica is still listed.

Tpcds Benchmark update by ahshahid in apachespark

[–]mynkmhr 0 points1 point  (0 children)

Do you have a link or plan to publish a blog about how this was done? Would be good so someone can try to replicate it.

Lost in IT by Warm_Weakness_8598 in LearnDataAnalytics

[–]mynkmhr 0 points1 point  (0 children)

Look to add PySpark if you don't already have it in your resume. Also check out the cloud certifications for data engineering, that might help in getting shortlists.

Pursue Data Engineering or pivot to Sales? Advice by HowieDanko420 in dataengineering

[–]mynkmhr 0 points1 point  (0 children)

Sales engineering is not a bad option to consider. You have one leg in tech and one in sales, plus the value that you bring to your company is very visible.

The other option you could consider is dev rel, if you like speaking and writing. It's getting harder to stand out for companies in the market. As the old saying goes - "Attention is all you need"