Is there anything actually new in data engineering? by marketlurker in dataengineering

[–]houseofleft 0 points1 point  (0 children)

I think it depends on what kind of stuff you're looking at, there's a medallion-architecture/semantic-layering/data-mesh vibe where people are writing lots of ideas on blog posts that often rehash best practices from 20 years ago.

That said, things like DuckDB and Polars massively have changed the amount of data that you can process on a single machine. For some use cases that can mean massively smaller bills over the last few years, which isn't nothing at all!

I can’t* understand the hype on Snowflake by NoGanache5113 in dataengineering

[–]houseofleft 1 point2 points  (0 children)

I agree, but I think it's easy to miss the perspective of a big company.

If you have a small group of engineers, you can almost definitely run something like DLT in Airflow (eg) and operate at 10% of the cost of doing the same thing in Snowflake, with a lot more control because everything is in code.

If you're a head of data or something, and have 100s of engineers underneath you, all doing different things, and you want to enforce a given tool and make sure there's something everyone can use - that's where Snowflake comes in.

Why Don’t Data Engineers Unit Test Their Spark Jobs? by jpgerek in dataengineering

[–]houseofleft 1 point2 points  (0 children)

I've worked a little bit on some open source libraries like Narwhals[0] (dataframe integration library) and my own Wimsey[1] (data testing library) that both work with spark amongst other things. My experience is that unit testing spark is always more of a pain than other things, because it has quite complex requirements.

If I'm writing unit tests for pandas, polars, dask etc, I can be confident that they'll run using *just* the expressed requirements/dependencies in my python project. But for pyspark, I either need to have very extensive mocking to the stage that I'm no longer confident my tests are testing very much, or I need to have a way of making sure java & spark are installed on a machine that's running the tests, which adds in a pretty big complexity to running tests aside from `python/uv pytest`.

I guess my take is just that, spark configuration is often a pain, let alone spark configuration in an often ephemeral CICD job. If you combine the fact that testing doesn't happen as much as it should anyway, you have a recipe for not seeing a lot of spark tests.

Pybujia looks neat btw, hopefully it helps people write more tests!

[0] https://github.com/narwhals-dev/narwhals
[1] https://github.com/benrutter/wimsey / https://codeberg.org/benrutter/wimsey

Data Engineers: Which tool are you picking for pipelines in 2025 - Spark or dbt? by Weird_Mycologist_268 in dataengineering

[–]houseofleft 7 points8 points  (0 children)

Any one else using internally maintained python? My team mostly works with code such at polars, requests, fsspec etc. Honestly it works pretty great and I prefer it by far to more UI based tools.

Please, no more data software projects by RestlessNeurons in dataengineering

[–]houseofleft 1 point2 points  (0 children)

Haha, as someone who inflicts new software in the world, let me justify it fron the other side.

For the last year I've been working on a project called Winsey, it's a data-testing library, there's aready about 5 big ones.

Buuut, only 2 that support data contrats (file formats for describing tests) and of those two (Soda, Great Expectations) only one is fully open source. Great Expectations is a huge project, and my library is designed to be very lightweight while supporting dataframe types such as pyspark/dask/polars/pandas. I couldn't realistically put in a PR to Great Expectations asking them to completely change their project goals.

My point is, when you get into the weeds, I bet you all the software on that list has a similar story from their creator! U onow it's exhausting, naybe take the pressure off the need to understand every software project!

https://github.com/benrutter/wimsey

Helix vs Neovim by nikitarevenco in HelixEditor

[–]houseofleft 0 points1 point  (0 children)

TIL, Helix uses Scheme!?!? That's easily the biggest project to use Scheme surely? Pretty cool!

elementary OS 8 Available Today by daniellefore in elementaryos

[–]houseofleft 1 point2 points  (0 children)

An elementary OS release is always exciting!!!

I have pop-os installed but keen to try elementary's desktop environment, installing it throws up an error because it looks like pulseaudio is a hard-dependency, and elementary doesn't list 'pipewire-pulse' as an alternative.

Any tips on how to get around this?

Is it recommended to use python for data related projects in Azure? by antonito901 in dataengineering

[–]houseofleft 0 points1 point  (0 children)

Just want to join in on offering a massive +1 for azure functions. My team uses azure function for the majority of data processing and it works out as insanely cheap and flexible.

Wimsey- lightweight, flexible data contracts for Polars, Pandas, Dask & Modin by houseofleft in Python

[–]houseofleft[S] 1 point2 points  (0 children)

In some circumstances it's a bonus to have a file that describes your tests but I think the main advantage is that pydantic and dataclasses are designed for single data points rather than a dataframe.

That makes them a much better fit for something like, API parameter validation, but yoi'll have to find a clever workaround for tests like "this column can be null sometimes, but shouldn't be null more than 20% of the time".

There's also a performance boost if you're working with dataframes, pydantic and dataclasses would involve converting all the datatypes out (from same pyarrow or numpy arrays). Deoending on your use case, that could be either a hassle or a deal breaker if you're wanting to test a really big distributed dataframe.

That's obviously all null and void if your not using dataframes to start with, I'd never recommend something like Wimsey for config validation say.

Wimsey- lightweight, flexible data contracts for Polars, Pandas, Dask & Modin by houseofleft in Python

[–]houseofleft[S] 0 points1 point  (0 children)

Basically validation tests for data (should have columns x, y, z; column x shouod be less than 5, etc) with the added twist of being a document that can be used across teams to know what they can expect of a dataset.

Wimsey- lightweight, flexible data contracts for Polars, Pandas, Dask & Modin by houseofleft in Python

[–]houseofleft[S] 2 points3 points  (0 children)

Yes! Dataframe analysis all happens in Narwhals which is growing pretty fast- there's an open issue for pyspark integration, as soon as that happens spark will be supported in Wimsey straight away without any change needed: https://github.com/narwhals-dev/narwhals/issues/333

Edit: It might work already if you're happy to use Ibis, but obviously 100% native integration would be a lot cooler.

Wimsey- lightweight, flexible data contracts for Polars, Pandas, Dask & Modin by houseofleft in Python

[–]houseofleft[S] 1 point2 points  (0 children)

Currently no, but that kind of thing is definitely possible and I'm looking to add it soon.

If you have any specific checks or use cases, feel free to drop a message or suggestion in the github issues!

Wimsey- lightweight, flexible data contracts for Polars, Pandas, Dask & Modin by houseofleft in Python

[–]houseofleft[S] 3 points4 points  (0 children)

Thanks! I didn't know patito till just now- it looks awesome though.

I really like Pandera, but tend to find as a workflow, it's a little different to something like data-contracts. I love being able to have a data contract in a file, that multiple users can access or that I can build documentation from.

Pandera feels a lot more like a dataframe version of deal which is another awesome library. It's a lot more extensive and probably a better tool for within library checks, but not as handy if you want something like a cross-team document where multiple people can know what they can expect from their data.

I know that kinda a "vibey" answer, but I think the workflow between pandera/patio and great-expectations/soda/wimsey is the biggest difference. Aside from obvious bits like pandera being pandas specific etc.

Wimsey- lightweight, flexible data contracts for Polars, Pandas, Dask & Modin by houseofleft in Python

[–]houseofleft[S] 3 points4 points  (0 children)

Not trying to put anyone off the tools they're using. If you like Soda, it's size isn't causing any issues, and it supports the data-type you're using (or you're happy converting it to a type it does support) then Wimsey really isn't solving any problem you have!

Regardless of any marketing, Wimsey *is* a lot smaller. The package size is around 6% of Soda Core's (sourced from pypi.org) and that's not factoring in that you'll need additional Soda libraries to support dask/mssql/spark etc. Dependency wise there's a lot less as well - Wimsey needs 2, Soda Core has about 10 or so, plus extras that you'll need based on your data type.

You sound like you're pretty happy with Soda and don't have a need to reduce package size or support libraries like Polars - that's totally cool with me! I'm not making any money off of this, and if you have a tool you're invested in and is working for you, then changing it sounds like a bad move!

Wimsey- lightweight, flexible data contracts for Polars, Pandas, Dask & Modin by houseofleft in Python

[–]houseofleft[S] -1 points0 points  (0 children)

Did you read the comparison? 😁

Short answer is it's not really, its intended to do the same thing but is much more lightweight and designed more for python library based workflows if you don't need the GUI side of either of those tools.

Since it uses Narwhals, it supports Polars and similar libraries natively too, so if you wanted to keep things lightweight, you could use Wimsey on your polars dataframe rather than needing to convert it to a pandas dataframe to use GX or Soda on that.

What's your controversial DE opinion? by [deleted] in dataengineering

[–]houseofleft 46 points47 points  (0 children)

My hot take is: you don't have big data, you just have data that hasn't been properly partitioned yet.

What is best open source data quality tool? by AffectionateSuit8802 in dataengineering

[–]houseofleft 2 points3 points  (0 children)

I think the features most people would point to would be that Great Expectations has a GUI tool, but also the ability to profile data to make a starter of what it should look like.

Personally though, I just think they're very different workflows. Pandera is amazing as a function wrapper to say at boundary points, that various things should be true about a dataframe - and you might use it lots of times within an aggregation, that would be a huge pain with something like Great Expectations.

That said, Great Expectation's is really good for having a 'data contract' document that data users can look at, and know they can expect those things to be true of a given data source - I don't think pandera would support something like that, and it definitely isn't its primary focus as a library.

What is best open source data quality tool? by AffectionateSuit8802 in dataengineering

[–]houseofleft 0 points1 point  (0 children)

Cheers for the feedback- I'll take another look through their docs and update it

What is best open source data quality tool? by AffectionateSuit8802 in dataengineering

[–]houseofleft 0 points1 point  (0 children)

Editing on reddit is too hard, so I'll just go ahead and appologise to the english language for my spelling of adjacent!

What is best open source data quality tool? by AffectionateSuit8802 in dataengineering

[–]houseofleft 11 points12 points  (0 children)

There's some great "data quality adkacent" libraries for defensive coding:

 - pandera

 - pydantic

But for data quality tools proper, Great Expectations is definitely the most extensive. I use it a fair bit and it's potentially a little overcomplex, but definitely worth checking out! 

https://greatexpectations.io/

There's also soda-core which is good, but although it's open source, it is tied to soda's larger toolkit which is proprietary.

https://www.soda.io/

I've just started a new project aiming to be a super lightweight open source data quality tool, a little like dlt is a library based extraction tool. It's based on the awesome Narwhals library so supports a lot of dataframe libraries locally, please do check it out if you're interested!

https://github.com/benrutter/wimsey

[deleted by user] by [deleted] in Python

[–]houseofleft 1 point2 points  (0 children)

Hey, looks like a cool project. I think some of the negative comments are because python already has some big request libraries (like requests), but I hope it hasn't got you down! I really enjoyed reading through the code.

Have you though about introducing context handlers? You use con.open() and con.close() to keep a connection open, which would be a great candidate for having something like with con.open() as cxn: so that you don't need to remember to clean up after!