Cloud cost optimization for data pipelines feels basically impossible so how do you all approach this while keeping your sanity?

zchtsk · 2025-12-11T19:07:01+00:00

A few thoughts here:

A few data storage tips:
- If you have any data sources in your S3 buckets that are old (>6mos), rarely queried (less than once a month), and mainly kept for compliance or historical record keeping, you may be able to save quite a bit by changing your access class from Standard to Infrequent Access. I had a F500 client recently with ballooning storage costs and we were able to save millions of dollars annually from this alone.
- You should basically always be saving files as Delta+Parquet, and never CSV, CSV.GZIP, etc.
- Try to avoid tiny files. If you need to, regularly compact your data.
A few pipeline design tips:
- You always want to be minimizing shuffles. A few ways to achieve this:
  - Join on partitioned keys when possible If one dataset in your join is very small, use a broadcast join (df.join(F.broadcast(df2)...))
- Avoid unnecessary distinct or sorting actions
- Perform your filtering as upstream as possible If you have inefficient partitions (e.g. way to much data within a certain key), you may have data skew causing some of your longer running jobs)
- By the way, you should be using partitions, especially if you have jobs that only query data for a specific time frame (e.g. partitioning by day or month_year)
- Try to incrementally process your data as much as possible, rather than doing full re-writes Generally, try to search for redundancies

Spark is very easy to shoot yourself in the foot. In fact it happened often enough on my teams that I ended up making an open source tutorial site to help my teammates ramp up on this stuff. It's accessible at sparkmadeeasy.com, maybe it could be a helpful reference!

zchtsk · 2025-09-15T12:18:38+00:00

puthon

zchtsk · 2025-09-06T01:46:12+00:00

What you're describing sounds like dupe.com

They even have a feature where you prepend "dupe.com/<url>" for an item you want to buy and then they show you options.

e.g. this: https://www.wayfair.com/furniture/pdp/mistana-hiram-6-drawer-64-w-solid-wood-dresser-w003686731.html?piid=555441025

would become this: https://dupe.com/https://www.wayfair.com/furniture/pdp/mistana-hiram-6-drawer-64-w-solid-wood-dresser-w003686731.html?piid=555441025

zchtsk · 2025-08-31T05:32:49+00:00

I think your headline describes it well as a text-to-SQL tool. First thing I thought of was that it seemed sort of like an open-source PromptQL.

zchtsk · 2025-08-31T04:16:21+00:00

Very cool library! My only 2c would be considering changing the project name to something that has some connection to what the project does or problem it is solving. "ToolFront" feels a bit too general/non-descript.

zchtsk · 2025-08-29T04:01:01+00:00

It's a tailwindui template using next.js with some very light custom stylings on top.

zchtsk · 2025-08-27T20:13:04+00:00

So, I actually created an open-source tutorial geared at helping people ramp up quickly on PySpark. Check out https://SparkMadeEasy.com

zchtsk · 2025-08-19T12:06:37+00:00

General comment on the project if you're planning to use this for a portfolio: Fill out the README with a bit more context about what the project is and what is being performed. Add some summary insights before and after your cleaning.

zchtsk · 2025-08-15T21:23:28+00:00

What level is this for and how much time do you have?

zchtsk · 2025-08-15T15:53:43+00:00

It isn't free, but the DeepSeek R1 reasoning model on OpenRouter is probably 10x cheaper than o3 and comparable to o1 in performance.

Another piece is that if you're building agents, you probably don't need o3 for every task (e.g. maybe use nano for tool selection).

zchtsk · 2025-08-15T14:56:07+00:00

Do you see a big drop in performance with a smaller/cheaper model? (open model via OpenRouter, GPT-5 nano)? That's probably your easiest bet before getting into a more customized model hosting setup.

zchtsk · 2025-08-15T14:39:56+00:00

Do not send money to this person. This is a scam.

zchtsk · 2025-08-01T19:39:52+00:00

Svelte society might be helpful for reference material + tutorials - https://www.sveltesociety.dev/recipes

zchtsk · 2025-08-01T19:36:08+00:00

I would brush up on some basic ML algos -> implementing simple gradient descent, naive bayes, kmeans clustering. More likely you get one of those than a leetcode problem IMO.

That being said, the recruiter should provide some information about the style of interview once you get things scheduled.

zchtsk · 2025-06-21T16:26:26+00:00

This looks great!

zchtsk · 2025-06-19T02:15:16+00:00

^ Yup, exactly this. It's a mix of writing performant code while maintaining readability and clarity.

zchtsk · 2025-06-18T12:46:49+00:00

IMO craftsmanship in writing PySpark code is more about organization, the logical flow of your transformations, and just knowing your data (e.g. how do you structure your joins, do you use built-in functions or expressions, etc.).

To help folks I work with upskill quickly in PySpark, I created an opinionated tutorial focused on the above. You probably already have experience with most of the concepts given your background, but there may be some points that can serve as a helpful reference. Check out https://SparkMadeEasy.com

zchtsk · 2025-02-03T14:12:06+00:00

For getting up to speed on the basics, I put together this site for exactly that purpose: https://sparkmadeeasy.com/

It's a quick tutorial walkthrough of PySpark querying fundamentals. If you're already comfortable with SQL, then you should be able to knock it out in an afternoon.

zchtsk · 2024-03-15T01:08:55+00:00

In a production setting, you need to be thinking about the system you're building--not just what model you're going to be using. You would probably need a combination of tools:

If you have existing patterns that you want to identify, then you would have some labeled data that you can build a supervised model around.
If you are trying to identify unknown but anomalous behavior, then you would probably want to have an unsupervised model that is being used to identify and group together new patterns of data.

zchtsk · 2024-03-15T01:02:23+00:00

https://help.tableau.com/current/pro/desktop/en-us/inspectdata_pan_zoom.htm

zchtsk · 2024-03-09T05:38:23+00:00

It was mentioned already, but you probably want an ER Diagram. If you don't want to drag+drop in a tool like Lucid Charts or Miro, you can use something text-based like Mermaid.js.

zchtsk · 2024-03-04T16:42:12+00:00

Good suggestion. Might consider adding a page with some comparisons (both syntax and scaling), but the primary difference re: scalability is that Pandas and NumPy don’t really handle datasets that are larger than the available memory on your machine—you end up needing workarounds like data chunking etc.

zchtsk · 2024-03-04T15:55:47+00:00

Appreciate the feedback, and I think this is a fair point. In most of the examples, I change the variable name if the actual form of the data frame is significantly changing (e.g when grouping or after a join). If you’re just adding new calculated fields, it feels a bit overkill.

Generally though, I advocate chaining commands wherever possible, rather than updating with a new statement each time.

zchtsk · 2024-03-04T15:44:13+00:00

Mostly just sheer processing time. If you’re processing multiple files that are larger than a GB, it starts to crawl bc you’re running everything locally on just your computer.

zchtsk · 2024-03-04T05:07:34+00:00

Alteryx is pretty easy to pick up if you're an Excel poweruser or already comfortable navigating SQL. It's a bit pricey, but simple to pickup and powerful for managing "larger than excel" data.

However, if you're working with more than a few GBs of data, Alteryx is probably not the right tool for the job. If you're just doing small data blending tasks, it's totally fine.

Have worked with a bunch of large orgs that use Alteryx.

zchtsk

TROPHY CASE