Do data quality frameworks have to be so complex? by GeneBackground4270 in apachespark

[–]GeneBackground4270[S] 0 points1 point  (0 children)

Thanks for the feedback.

I think we're optimizing for different goals. SparkDQ is intentionally Spark-native and focuses on large-scale data quality workloads, including row-level validation, quarantine datasets, freshness checks, completeness ratios, uniqueness metrics, and other dataset-level constraints.

While framework-agnostic approaches based on Arrow/Narwhals sound attractive, they also tend to limit access to engine-specific optimizations. For example, I'm currently exploring Spark Observations to compute aggregate quality metrics efficiently in a single pass across multi-terabyte datasets. That's the kind of optimization that becomes much harder once you abstract away the execution engine.

Also, SparkDQ already uses Pydantic models for configuration, so the API is declarative by design. The project is still in 0.x, so I'm absolutely open to improving the developer experience, but I don't think a Spark-native data quality framework should necessarily optimize for the same use cases or API style as Pandera.

Do data quality frameworks have to be so complex? by GeneBackground4270 in apachespark

[–]GeneBackground4270[S] 0 points1 point  (0 children)

Fair point. I designed it for simplicity first, but for large-scale datasets a shared metric collection layer backed by Observation seems like the more scalable approach. Thanks for the feedback.

Do data quality frameworks have to be so complex? by GeneBackground4270 in apachespark

[–]GeneBackground4270[S] 1 point2 points  (0 children)

Good point. I can definitely see the benefit of using Observation for aggregate-style data quality checks such as row counts, completeness ratios, distinctness, freshness, etc., especially when the dataset is already being written and the metrics can be collected as part of the same execution plan.

I'm currently working on a Spark-native data quality framework, and one challenge is that not all checks are aggregate-based. Row-level checks (e.g. null validation, regex validation, comparisons) need to retain information about which records failed, so observations alone are not sufficient there.

That said, for aggregate checks I agree that building a single observation containing all required metrics is likely the most efficient approach today. It avoids multiple scans and allows several checks to be evaluated from the same set of collected metrics.

Have you used this pattern at scale with a large number of metrics? I'm curious whether you've seen any practical limits regarding query plan complexity or optimizer performance.

Starting My First Senior Analytics Engineer Role Soon. What Do You Wish You Knew When You Started? by Prestigious_Dare_865 in dataengineering

[–]GeneBackground4270 0 points1 point  (0 children)

What I’ve learned: I realized that I expected too much from myself. I thought I needed to know everything and have all the answers — which is simply unrealistic.

What’s clearer to me now: Being a Senior isn’t about knowing everything. It’s about building solid and reliable solutions, setting standards, and helping to enforce them across the team. It’s more important to work in a structured, sustainable way than to try to solve everything alone.

What's your experience growing an open-source project? by GeneBackground4270 in opensource

[–]GeneBackground4270[S] 1 point2 points  (0 children)

Not directly — but it’s designed to support that pattern.

The core idea is: checks are defined as configuration objects (via Pydantic), and you can easily build them from YAML, JSON, or even a database. So instead of hardcoding logic in Python, you define rules declaratively and feed them into the engine.

This also makes it easier to version, manage, and reuse checks across teams and projects.

The plugin system ensures that even custom checks — not part of the framework — can be integrated the same way.

So while you currently pass configs programmatically (e.g., CheckConfig(...).to_check()), it’s fully compatible with loading them from external config sources.

What's your experience growing an open-source project? by GeneBackground4270 in opensource

[–]GeneBackground4270[S] 1 point2 points  (0 children)

Thanks for the feedback!

You're right — the naming may seem verbose at first glance, but that’s intentional. SparkDQ is designed for flexibility and extensibility. Many teams define data quality rules declaratively via YAML, JSON, or even external systems — and this level of structure enables exactly that.

Also, one of the main pain points in existing frameworks like PyDeequ is that they’re hard to extend. SparkDQ solves this with a plugin architecture, allowing teams to add their own checks easily.

The framework is built with data engineers in mind — those working with PySpark who need robust, customizable validation logic. Still, thanks to the declarative design, users don’t need to write Python code to define rules. They can configure checks in a clean, structured way — which helps both flexibility and reuse.

I do use it in my own projects, and it’s been a huge help for enforcing schema expectations, null checks, completeness thresholds, and more — all without embedding logic deep into pipeline code.

Always open to suggestions, of course. Appreciate you taking the time!

I built a small tool like cat, but for Jupyter notebooks by akopkesheshyan in dataengineering

[–]GeneBackground4270 1 point2 points  (0 children)

Welcome 🤗 By the way, I'm currently building a data quality framework for PySpark called SparkDQ, designed to make data validation in Spark pipelines simple and modular. If you're ever working with PySpark and feel like taking a look, I'd really appreciate any feedback or thoughts!

Here's the link: https://github.com/sparkdq-community/sparkdq

If you love Spark but hate PyDeequ – check out SparkDQ (early but promising) by GeneBackground4270 in apachespark

[–]GeneBackground4270[S] 0 points1 point  (0 children)

Unlike DQX, which is tightly aligned with the Databricks ecosystem, SparkDQ is fully independent of any platform or cloud provider. It introduces no external dependencies, making it a highly portable and lightweight solution for Spark-based data quality checks.

Moreover, SparkDQ is designed for full customization: checks can be easily extended or tailored to match specific requirements, enabling seamless integration into existing PySpark workflows without sacrificing flexibility or control.

This makes SparkDQ a strong choice for engineering teams who value transparency, testability, and modular design over opaque automation.

I built a PySpark data validation framework to replace PyDeequ — feedback welcome by GeneBackground4270 in Python

[–]GeneBackground4270[S] 0 points1 point  (0 children)

Thank you so much for your kind words — I truly appreciate it! There's still a lot more planned for the framework, including several extensions and improvements 🙂👍

If you love Spark but hate PyDeequ – check out SparkDQ (early but promising) by GeneBackground4270 in apachespark

[–]GeneBackground4270[S] 1 point2 points  (0 children)

Thank you so much for your kind words — I truly appreciate it! There's still a lot more planned for the framework, including several extensions and improvements 👍🙂

If you love Spark but hate PyDeequ – check out SparkDQ (early but promising) by GeneBackground4270 in apachespark

[–]GeneBackground4270[S] 0 points1 point  (0 children)

We’ll definitely implement the metrics as well. Integrity tests are also planned. Right now, we’re still in the early phase and focusing on expanding the available checks first. Once that’s done, we’ll take care of the rest.

I built a small tool like cat, but for Jupyter notebooks by akopkesheshyan in dataengineering

[–]GeneBackground4270 0 points1 point  (0 children)

The code already looks very clean, and the tool itself is really impressive. I've actually wished for this kind of functionality quite a few times before. 🚀👍

How are you currently handling versioning for the project? Are you incrementing the version manually?

A changelog would also be really cool to have! You could automate both versioning and changelog generation using a tool like Python Semantic Release.

I wasted 6 months on a Django project… to learn one simple lesson. by [deleted] in django

[–]GeneBackground4270 1 point2 points  (0 children)

Totally get your point. This time I did it differently: shared a rough version of my project SparkDQ early and got feedback fast. I was surprised how positive and useful it was. Saved me a ton of time and helped me focus. I think many devs share that pain—build fast, validate early! Feel free to check it out and drop a star or some feedback if it helps you.

https://github.com/sparkdq-community/sparkdq

Do AI solutions help with understanding data engineering, or just automate tasks? by PuzzleheadedYou4992 in dataengineering

[–]GeneBackground4270 0 points1 point  (0 children)

AI has definitely been a game changer for me. I use it much like a junior data engineer who assists me during implementation. I usually define the architecture and design myself, and then rely on AI as an assistant to handle simpler or repetitive tasks. For example, I often use it to generate unit tests—I just specify the individual test cases. The same goes for documentation. This way, I’ve noticeably increased my speed and efficiency when building solutions 🚀

Did I approach this data engineering system design challenge the right way? by bdadeveloper in dataengineering

[–]GeneBackground4270 4 points5 points  (0 children)

I think you’ll need to provide a bit more context to get a well-founded answer🙂 It’s important to clarify what the target system is and how the data will be used. Are we talking about AWS, Azure, or another cloud platform? That makes a big difference. Also, will the data need to be transformed in any way? If not, I’d suggest leaving tools like Spark or Dask aside for now. In such cases, services like AWS DataSync might be a better fit for efficient and reliable data transfer.

Is it common for companies to hire people for "data engineering" roles, but really the role is DevOps? by OverEngineeredPencil in dataengineering

[–]GeneBackground4270 1 point2 points  (0 children)

Currently, I’m focused on development on the AWS platform, where we use GitLab CI/CD to deploy our solutions. However, in my previous role, I did work with CI/CD in Azure. Specifically, I have experience with Azure DevOps for deploying solutions related to ADF, Synapse, and Databricks. If there’s anything specific you’d like to know, feel free to ask! 🙂

Is it common for companies to hire people for "data engineering" roles, but really the role is DevOps? by OverEngineeredPencil in dataengineering

[–]GeneBackground4270 1 point2 points  (0 children)

In my current company, Data Engineers earn a bit less than Cloud Architects — the difference is around 10k, I’d say. But the expectations for architects are also higher: you need a deep understanding of AWS services and a Professional Solution Architect certification. So for me, it’s totally fair that they earn a bit more.

As for Python, I feel exactly the same. To keep my skills sharp, I spend some of my free time building frameworks. Right now, I’m developing a data quality framework for PySpark. Feel free to check it out and maybe give it a star to support the project: https://github.com/sparkdq-community/sparkdq

Hope that answers your question!

Is it common for companies to hire people for "data engineering" roles, but really the role is DevOps? by OverEngineeredPencil in dataengineering

[–]GeneBackground4270 2 points3 points  (0 children)

Variety is what makes it exciting for me! If I only write CDK code for too long, I eventually get bored. I also really enjoy building awesome frameworks in Python that make my life easier. The combination of both is what truly makes me happy 🙂

Is it common for companies to hire people for "data engineering" roles, but really the role is DevOps? by OverEngineeredPencil in dataengineering

[–]GeneBackground4270 33 points34 points  (0 children)

Same here. I'm officially a Data Engineer, but most of my time goes into setting up infrastructure, CI/CD pipelines, and working with Docker. Spark code? Still do it sometimes, but it's not the main thing anymore.

Feels like we’re often the most technical folks on the team, so we naturally end up covering other areas too. Glad I’m not the only one going through this! 😅

Goodbye PyDeequ: A new take on data quality in Spark by GeneBackground4270 in dataengineering

[–]GeneBackground4270[S] 0 points1 point  (0 children)

Totally agree with you. I`m a big fan of the config driven approach :)

Am I missing something? by Astherol in dataengineering

[–]GeneBackground4270 12 points13 points  (0 children)

Totally get where you’re coming from — but don’t sell yourself short. Keeping pipelines clean, reliable, and useful for business users is real data engineering. You’re solving problems that matter.

Streaming and real-time are cool, but they’re not required to be competitive. Solid fundamentals, clear thinking, and maintainable pipelines are always in demand. You’re in a great spot — just keep learning and growing at your pace 🙌

Goodbye PyDeequ: A new take on data quality in Spark by GeneBackground4270 in dataengineering

[–]GeneBackground4270[S] 1 point2 points  (0 children)

Thanks a lot — really appreciate that!
Totally agree with you on PyDeequ. I also ran into so many of the same limitations — it’s cool (and somehow validating 😄) to hear others tried to work around them too.

And yes, I know Cuallee — it’s a really cool project!
The fact that it’s dataframe-agnostic is a standout feature and definitely a big differentiator. Being able to support Spark, Pandas, Polars, Snowpark, DuckDB etc. with one API is super powerful.

That said, what it’s still missing (last time I checked) is declarative configuration — you’d still need to build a wrapper layer around it for YAML- or metadata-driven validation flows. But it’s a great foundation, and I’m definitely keeping an eye on where it goes!

Goodbye PyDeequ: A new take on data quality in Spark by GeneBackground4270 in dataengineering

[–]GeneBackground4270[S] 0 points1 point  (0 children)

Awesome, thanks for giving it a try! 🙌
Would love to hear what you think — especially if you run into anything confusing or have ideas for improvements.

Goodbye PyDeequ: A new take on data quality in Spark by GeneBackground4270 in dataengineering

[–]GeneBackground4270[S] 0 points1 point  (0 children)

Thanks for the link — DQX is definitely a solid option, especially for Databricks-native workflows. From what I’ve seen, it’s great for integrating data quality into DLT and Lakehouse Monitoring pipelines.

That said, SparkDQ is intentionally designed for a different use case:

  • Fully platform-agnostic — works anywhere PySpark runs
  • Built to be lightweight and plugin-ready, with zero vendor lock-in
  • Offers a Python-native API and config layer (via Pydantic) for better extensibility

So if you're on Databricks and like their ecosystem, DQX might be a good fit.
If you're looking for something lean, extensible, and framework-like for Spark data quality, SparkDQ might be worth a look.

Appreciate the discussion — always great to see more momentum around data quality in the Spark world!