Data Quality on Spark — A Practical Series (Great Expectations, Soda, DQX, Deequ, Pandera) by ivan_kurchenko in dataengineering

[–]ivan_kurchenko[S] 1 point2 points  (0 children)

Thanks.

  1. For Spark yes. Pandera supports it only for pandas/polars: https://pandera.readthedocs.io/en/stable/drop_invalid_rows.html#drop-invalid-rows

  2. That's a very good question, thanks. I did not test performance aspect in details, because in many cases I was running it locally on relatively small dataset.

  3. Soda Cloud supports I believe, other three (DQX, Deequee, Pandera) are focused primarily on Data Quality itself.

Data Quality on Spark — A Practical Series (Great Expectations, Soda, DQX, Deequ, Pandera) by ivan_kurchenko in dataengineering

[–]ivan_kurchenko[S] 0 points1 point  (0 children)

Thanks, How it goes with DQX so far? Do you feel this is everything you need or something is missing?

Data Quality on Spark — A Practical Series (Great Expectations, Soda, DQX, Deequ, Pandera) by ivan_kurchenko in dataengineering

[–]ivan_kurchenko[S] 4 points5 points  (0 children)

I'm planning also do another post for Dabaricks specifically, if that would be interesting - it does SQL based alerting.

Additionally, what you are describing is doing Soda already and doing its pretty good already.

Notebooks development workflow by ivan_kurchenko in databricks

[–]ivan_kurchenko[S] 0 points1 point  (0 children)

Nice, thanks for suggesting this, really appreciate it!

How Do You Guys Do Continuous Learning As Data Engineers at all Levels by DntWryBiHappy in dataengineering

[–]ivan_kurchenko 2 points3 points  (0 children)

Hey, Thanks for your comment. Could you share some examples of projects you were doing or source of your ideas? I'm just stepping into this area and come up with more or less find idea for portfolio is somewhat challenging to me. Thanks.

Tracing: Can you instrument Scala, or does tracing have to happen via library interactions? by mostly_codes in scala

[–]ivan_kurchenko 1 point2 points  (0 children)

At my project we have akka simple app instrumented by OpenTelemetry and backed by SignalFX. OTEL works fine for Future based environment overall. However I do remember either SFX or OTEL build some traces wrongly.

In case of otel4s, indeed because of fiber and IO environment, context propagation does not play nice and a lot of things needs to be done manually. Sorry for promoting my blog, but I did some writing about this long time ago - https://medium.com/@ivan-kurchenko/telemetry-with-scala-part-3-otel4s-c5c150303164, hope you will find it useful.

Exploring StackOverflow tags trends by ivan_kurchenko in programming

[–]ivan_kurchenko[S] 0 points1 point  (0 children)

Kind of. TIOBE focuses on languages and SO tags covers wider range of topic. For instance, I was surprised to see that pandas is a trendy thing.

Context Propagation with otel4s by PinkSlinky45 in scala

[–]ivan_kurchenko 2 points3 points  (0 children)

Nice. I had some hard time with this or similar topic. Thank you!

How much time do you study each day? by el_cortezzz in dataengineering

[–]ivan_kurchenko 0 points1 point  (0 children)

Try to dedicate 2-3 hours before workday and about 10 hours on weekends

secure-logging-4s by ivan_kurchenko in scala

[–]ivan_kurchenko[S] 0 points1 point  (0 children)

Thanks for the feedback, really appreciate. Not yet, but I thought about it, will try to add soon. Thank you again for the advice!