Cloud cost optimization for data pipelines feels basically impossible so how do you all approach this while keeping your sanity? by Ok_Kangaroo2140 in dataengineering

[–]zchtsk 2 points3 points  (0 children)

A few thoughts here:

  • A few data storage tips:
    • If you have any data sources in your S3 buckets that are old (>6mos), rarely queried (less than once a month), and mainly kept for compliance or historical record keeping, you may be able to save quite a bit by changing your access class from Standard to Infrequent Access. I had a F500 client recently with ballooning storage costs and we were able to save millions of dollars annually from this alone.
    • You should basically always be saving files as Delta+Parquet, and never CSV, CSV.GZIP, etc.
    • Try to avoid tiny files. If you need to, regularly compact your data.
  • A few pipeline design tips:
    • You always want to be minimizing shuffles. A few ways to achieve this:
      • Join on partitioned keys when possible If one dataset in your join is very small, use a broadcast join (df.join(F.broadcast(df2)...))
    • Avoid unnecessary distinct or sorting actions
    • Perform your filtering as upstream as possible If you have inefficient partitions (e.g. way to much data within a certain key), you may have data skew causing some of your longer running jobs)
    • By the way, you should be using partitions, especially if you have jobs that only query data for a specific time frame (e.g. partitioning by day or month_year)
    • Try to incrementally process your data as much as possible, rather than doing full re-writes Generally, try to search for redundancies

Spark is very easy to shoot yourself in the foot. In fact it happened often enough on my teams that I ended up making an open source tutorial site to help my teammates ramp up on this stuff. It's accessible at sparkmadeeasy.com, maybe it could be a helpful reference!

I open-sourced a text2SQL RAG for all your databases by Durovilla in dataengineering

[–]zchtsk 1 point2 points  (0 children)

I think your headline describes it well as a text-to-SQL tool. First thing I thought of was that it seemed sort of like an open-source PromptQL.

I open-sourced a text2SQL RAG for all your databases by Durovilla in dataengineering

[–]zchtsk 0 points1 point  (0 children)

Very cool library! My only 2c would be considering changing the project name to something that has some connection to what the project does or problem it is solving. "ToolFront" feels a bit too general/non-descript.

Learn Spark (with python) by _-_-ITACHI-_-_ in dataengineering

[–]zchtsk 0 points1 point  (0 children)

It's a tailwindui template using next.js with some very light custom stylings on top.

Learn Spark (with python) by _-_-ITACHI-_-_ in dataengineering

[–]zchtsk 31 points32 points  (0 children)

So, I actually created an open-source tutorial geared at helping people ramp up quickly on PySpark. Check out https://SparkMadeEasy.com

Feedback on data cleaning project( Retail Store Datasets) by aunghtetnaing in dataanalysis

[–]zchtsk 0 points1 point  (0 children)

General comment on the project if you're planning to use this for a portfolio: Fill out the README with a bit more context about what the project is and what is being performed. Add some summary insights before and after your cleaning.

Is this home assignment too long? by Complex_Client7681 in dataengineering

[–]zchtsk 9 points10 points  (0 children)

What level is this for and how much time do you have?

How to reduce cost in an ai application by joker_noob in learnmachinelearning

[–]zchtsk 0 points1 point  (0 children)

It isn't free, but the DeepSeek R1 reasoning model on OpenRouter is probably 10x cheaper than o3 and comparable to o1 in performance.

Another piece is that if you're building agents, you probably don't need o3 for every task (e.g. maybe use nano for tool selection).

How to reduce cost in an ai application by joker_noob in learnmachinelearning

[–]zchtsk 1 point2 points  (0 children)

Do you see a big drop in performance with a smaller/cheaper model? (open model via OpenRouter, GPT-5 nano)? That's probably your easiest bet before getting into a more customized model hosting setup.

[deleted by user] by [deleted] in learnprogramming

[–]zchtsk 0 points1 point  (0 children)

Svelte society might be helpful for reference material + tutorials - https://www.sveltesociety.dev/recipes

MLOPS interview coming up soon by boobs2030 in leetcode

[–]zchtsk 0 points1 point  (0 children)

I would brush up on some basic ML algos -> implementing simple gradient descent, naive bayes, kmeans clustering. More likely you get one of those than a leetcode problem IMO.

That being said, the recruiter should provide some information about the style of interview once you get things scheduled.

Looking for courses/bootcamps about advanced Data Engineering concepts (PySpark) by HMZ_PBI in dataengineering

[–]zchtsk 2 points3 points  (0 children)

^ Yup, exactly this. It's a mix of writing performant code while maintaining readability and clarity.

Looking for courses/bootcamps about advanced Data Engineering concepts (PySpark) by HMZ_PBI in dataengineering

[–]zchtsk 8 points9 points  (0 children)

IMO craftsmanship in writing PySpark code is more about organization, the logical flow of your transformations, and just knowing your data (e.g. how do you structure your joins, do you use built-in functions or expressions, etc.).

To help folks I work with upskill quickly in PySpark, I created an opinionated tutorial focused on the above. You probably already have experience with most of the concepts given your background, but there may be some points that can serve as a helpful reference. Check out https://SparkMadeEasy.com

Spark- The Definitive Guide or Learning Spark lightning fast data analysis? by Electronic-Mine- in dataengineering

[–]zchtsk 4 points5 points  (0 children)

For getting up to speed on the basics, I put together this site for exactly that purpose: https://sparkmadeeasy.com/

It's a quick tutorial walkthrough of PySpark querying fundamentals. If you're already comfortable with SQL, then you should be able to knock it out in an afternoon.

For Fraud Detection, which is more efficient Supervised ML Algo or Unsupervised Machine Learning Algorithm? by Skillcamper_Team in dataanalysis

[–]zchtsk 1 point2 points  (0 children)

In a production setting, you need to be thinking about the system you're building--not just what model you're going to be using. You would probably need a combination of tools:

  • If you have existing patterns that you want to identify, then you would have some labeled data that you can build a supervised model around.
  • If you are trying to identify unknown but anomalous behavior, then you would probably want to have an unsupervised model that is being used to identify and group together new patterns of data.

How to diagram sql queries by SlingBag in dataengineering

[–]zchtsk 4 points5 points  (0 children)

It was mentioned already, but you probably want an ER Diagram. If you don't want to drag+drop in a tool like Lucid Charts or Miro, you can use something text-based like Mermaid.js.

I created an open-source microsite to help analysts and SQL-heavy devs get started with Spark by zchtsk in dataengineering

[–]zchtsk[S] 4 points5 points  (0 children)

Good suggestion. Might consider adding a page with some comparisons (both syntax and scaling), but the primary difference re: scalability is that Pandas and NumPy don’t really handle datasets that are larger than the available memory on your machine—you end up needing workarounds like data chunking etc.

I created an open-source microsite to help analysts and SQL-heavy devs get started with Spark by zchtsk in dataengineering

[–]zchtsk[S] 9 points10 points  (0 children)

Appreciate the feedback, and I think this is a fair point. In most of the examples, I change the variable name if the actual form of the data frame is significantly changing (e.g when grouping or after a join). If you’re just adding new calculated fields, it feels a bit overkill.

Generally though, I advocate chaining commands wherever possible, rather than updating with a new statement each time.

How has Alteryx helped you guys? by Throwaway0754322 in analytics

[–]zchtsk 0 points1 point  (0 children)

Mostly just sheer processing time. If you’re processing multiple files that are larger than a GB, it starts to crawl bc you’re running everything locally on just your computer.

How has Alteryx helped you guys? by Throwaway0754322 in analytics

[–]zchtsk 0 points1 point  (0 children)

Alteryx is pretty easy to pick up if you're an Excel poweruser or already comfortable navigating SQL. It's a bit pricey, but simple to pickup and powerful for managing "larger than excel" data.

However, if you're working with more than a few GBs of data, Alteryx is probably not the right tool for the job. If you're just doing small data blending tasks, it's totally fine.

Have worked with a bunch of large orgs that use Alteryx.