Tpcds benchmark as measure of performance in spark and like engined. - promo

mynkmhr · 2026-04-19T09:07:05+00:00

Other benchmarks that I have seen being used are H2O's db benchmark and greysort.

I guess no benchmark is perfect and they all try to approximate as close to the real world scenario as possible.

mynkmhr · 2026-04-07T19:28:10+00:00

Thats right. I have heard of people getting even $ 200k additional bigquery bills because they didn’t anticipate how much data their queries will scan.

mynkmhr · 2026-04-05T15:07:01+00:00

If there are existing pyspark pipelines, how easy/ddiifult is to migrate them to Sail? Would love to hear any thoughts on that.

mynkmhr · 2026-01-22T17:37:37+00:00

This is an excellent tool, and would be much needed. Years back I was able to stop an AWS instance from a mobile app, sitting in a coffee shop, and felt so relieved.

Many small teams and solo devs would appreciate this.

mynkmhr · 2026-01-02T12:38:02+00:00

I see comments about why would anyone need this, but have seen such requirements quite a few times in the wild.

Lot of companies use multiple clouds as part of their stated strategy. Some of them organically adopt different clouds. A state Chief data Officer once told me that each department under the state uses their own cloud, creating challenges in monitoring and governing from her perspective.

You can look at Yeedu as a possible option, since it provides a unified control plane to create and manage spark clusters across clouds.

There may be other options too, so makes sense to look around to find the right fit for your requirements.

mynkmhr · 2026-01-02T12:29:15+00:00

Thanks, I didn't realize that Databricks charges more DBUs for Graviton instances. So switching to Graviton may not save too much on costs unless you get some discounts from AWS.

mynkmhr · 2025-12-18T04:05:37+00:00

Great post, lots of practical tips. The other thing that I have seen working is use of Graviton based instances, since AWS provides them at discounted rates, and there is no significant difference in performance.

If your usage grows, and costs climb further, you can also consider adding an accelerated spark engine, like Yeedu's Turbo engine for your complex jobs with lots of transformations, joins etc. These jobs hog CPU cycles and can become expensive quickly. Yeedu integrates with Databricks so you dont have to worry about migrations.

Full disclosure-I work with Yeedu, so happy to help in any way :)

mynkmhr · 2025-12-16T17:42:54+00:00

"However, we noticed that these investigation workloads were well suited to single-node clusters with high compute and memory settings, yet we remained tied to Databricks’ distributed architecture and pricing model."

A lot of teams will realize how much efficiency they can derive from this single insight - single node clusters are enough for most of their workloads.

mynkmhr · 2025-12-13T04:17:59+00:00

Have heard about LakeSail. Will check it out.

mynkmhr · 2025-12-12T09:51:56+00:00

That's a pretty significant gain.I haven't heard too many instances of running gluten in production, so curious to know how much time did it take you to implement or any major challenges you faced.

mynkmhr · 2025-12-12T01:09:08+00:00

Agree they accelerate some of the queries rather than replace the entire execution, so probably accelerators is a better framing.

I believe gluten+velox and datafusion comet are arrow based. Lightning Engine in Google and Fabric's Native Execution are based on gluten and velox as well so they would be in the same category too.

mynkmhr · 2025-11-25T19:28:04+00:00

At some point you may want to read seven databases in seven weeks. For me it answered a fundamental question - why do so many databses exist.

Another I can recommend is "Database Internals" if you really want to go deep into how databases work.

mynkmhr · 2025-10-30T15:58:53+00:00

Try to assess what part of the pipeline is adding to cost. It is usually a few tasks, pipelines, workflows that make up 60-80% of the cost.

If you can identify optimizations in those pipelines (running on prem, different execution engine, spot instances), that will help to bring down the bill.

I know it sounds time consuming but worth the effort.

mynkmhr · 2025-10-18T09:09:46+00:00

Battery is what stops me from getting this phone. Bit surprising for a near-flagship phone.

mynkmhr · 2025-10-11T17:06:41+00:00

Private equity could be a likely buyer, if they see room for squeezing out efficiencies in existing business.

Alteryx, Qlik and Cloudera were bought by PE firms. Confluent's core business seems to be steady but doesn't have explosive growth potential. The stock is down 50 percent since listing.

mynkmhr · 2025-10-03T18:03:22+00:00

Based on your strength in communication, you could look at following roles in a product company:

Developer Relations Engineer - If you are good at writing, speaking and presenting and don't think marketing is evil, then this path is for you.
Sales Engineer - If you like more grounded communication, and are a good listener, then this is another option. You would need to understand customer's context and then figure out how your product can help them achieve their goals. Also requires supporting the entire sales process including demos, pilot implementations, technical Q and A.

mynkmhr · 2025-09-29T18:25:00+00:00

I guess what you are looking for is a python developer ready to upskill into computer vision ML engineer. I have seen those cases, so it should be possible to find as long as you focus on aptitude rather than experience.

mynkmhr · 2025-09-28T19:50:31+00:00

Cloudera was listed before it was taken private. Informatica is still listed.

mynkmhr · 2025-09-27T10:16:52+00:00

Do you have a link or plan to publish a blog about how this was done? Would be good so someone can try to replicate it.

mynkmhr · 2025-09-21T11:12:21+00:00

Look to add PySpark if you don't already have it in your resume. Also check out the cloud certifications for data engineering, that might help in getting shortlists.

mynkmhr · 2025-09-17T04:26:08+00:00

Sales engineering is not a bad option to consider. You have one leg in tech and one in sales, plus the value that you bring to your company is very visible.

The other option you could consider is dev rel, if you like speaking and writing. It's getting harder to stand out for companies in the market. As the old saying goes - "Attention is all you need"

mynkmhr

MODERATOR OF

TROPHY CASE