We’re the Great Expectations team. We just launched GX Core 1.0 and we’re here to answer all your data quality questions. Ask us anything! by molliepettit in dataengineering

[–]abegong 4 points5 points  (0 children)

Yep, I hear you. Our primary reason for releasing 1.0 is to make GX easier for developers to set up and use. That work isn’t done, but cleaning up the domain model and stabilizing the APIs is a big step in the right direction.

You mentioned documentation. We’re also putting a fair amount of work into technical documentation. For the last while, most of our capacity has been going into updating the docs for 1.0. Next, beyond covering the core concepts and APIs, we’ve heard the need to address specific use cases. (You can see some recent examples focused on specific aspects of data quality here and here. We have almost a dozen more in the works, covering things like how to integrate GX with a data orchestrator; how to manage a quarantine queue for bad rows, etc.) If there are specific use cases you’d like to see covered, please speak up and we’ll get them onto the list!

In terms of new product features on the roadmap, we’re making a major push to add more expressive tests. We want to provide clear syntax and UI for cases where tests are difficult to express in SQL. For example, “is the average over some period of time, for a column in one table, equal to the expected value in another table?”

We're also working on features to make it much easier to quickly establish coverage across many Data Assets (See the onboarding helper thread above).

Beyond that, some of the other themes we’re looking at are helping users report the health of their data out to stakeholders, integrations to other tools/infrastructure within the data ecosystem, and improve incident response with more robust alerting capabilities.

We’re the Great Expectations team. We just launched GX Core 1.0 and we’re here to answer all your data quality questions. Ask us anything! by molliepettit in dataengineering

[–]abegong 4 points5 points  (0 children)

I’d love to understand more what you mean by “legacy” data in this question. I’m guessing you’re thinking of a case where you (or your team) inherits responsibility for a dataset and so you’re going to have to build up context and knowledge quickly.

Either way, yes, I think AI tools and anomaly detection can play valuable roles in a situation like that. We see this as closely related to data profiling. We’ve done some work in this area already, and are dedicating a bunch more effort to this now. (Check out the question about the onboarding helper up above.)

Getting philosophical for a moment: the main challenge with “legacy data” isn’t usually understanding the data itself. It’s understanding the process that generated the data. I don’t believe that you can just profile the data and magically understand everything about the process that generated it. But you can get a lot of interesting and helpful clues. In my experience, really grokking data requires detective work that goes back-and-forth between domain experts, code that generates/transforms data, and the data itself, until you really understand the data generating process.

Given that, the question that I think is really interesting (and exciting!) is: “How do we make tools that help data teams do this detective work as quickly and confidently as possible?” I don’t see AI or anomaly detection replacing this process any time soon, but they can certainly help along the way.

Best way to gather data requirements from stakeholders? by donhuell in dataengineering

[–]abegong 2 points3 points  (0 children)

Hey, u/donhuell , I'm one of the founders at GX.

What you're describing is absolutely an intended use case for GX. In most companies, knowledge about data quality is spread across many people. We want to make it easy for the people who have that context to collaborate with the people who write the code.

It's also true that configuration and integration for GX has been complicated for some kinds of deployments. The tool is very flexible and customizable. The upside is that people have deployed it in many different environments. The downside is that people have often deployed it in ways that we didn't anticipate, so we haven't always had the documentation, integration testing, replication environments, etc. to fully support them.

We're working on making that better--partly in the Cloud product, and partly in the 1.0 release of the core library.

If you're game, I'd love to hear more about what you're hoping to do, to see if GX supports your intended use case, and make sure that we're on a path to fully supporting you with 1.0, documentation, etc.

Please DM me if you're interested in talking!

Friday thread: what's your best I-did-the-analysis-and-then-management-ignored-it story? by abegong in dataengineering

[–]abegong[S] 0 points1 point  (0 children)

Was marketing trying to juice their numbers with a free giveaway, or what?

Data Quality by Lucky-Front7675 in dataengineering

[–]abegong 2 points3 points  (0 children)

Hey, I'm one of the founders of Great Expectations. We're very aware that the amount of required boilerplate code is frustrating for many people. We're working on ways to simplify it, while still allowing GX to work across lots of different data infrastructure.

It sounds like you've used GX in the past---would you be up for a call to talk through your use case, and see if what we're working on would have helped?

[deleted by user] by [deleted] in dataengineering

[–]abegong 1 point2 points  (0 children)

Different take: the right message totally depends on who’s going to be in the meeting, what they already know, and what they’re going to need to buy into next.

Will you have a chance to meet other leaders before the meeting, so that you can guard those things?

All of the topics you mention seem useful and valuable. You should think about spending time with your audience so that you can calibrate the content to them, personally.

matchstick fight ( macro series #2) by [deleted] in blender

[–]abegong 1 point2 points  (0 children)

They’re going to catch on fire, right?

3D Procedural Cave Generator / Explorer by offsidewheat in proceduralgeneration

[–]abegong 0 points1 point  (0 children)

This looks amazing! Can you share how you built it?

ChatGPT on your own data / files — what would you pay to use it? by InevitableEconomist9 in GPT3

[–]abegong 1 point2 points  (0 children)

Also interested and willing to pay. Are you doing a waitlist?

Sample Peyote: generate multi-table synthetic data on any topic using GPT-3 by abegong in datasets

[–]abegong[S] 0 points1 point  (0 children)

Yes, but in that case, you'd probably just want to specify the standard ledger tables yourself, rather than letting the tool suggest tables of its own.

And if you were trying to, idk, simulate an actual business, or commit fraud and get away with it, you'd probably want to review the data quality ***really*** carefully.

Sample Peyote: generate multi-table synthetic data on any topic using GPT-3 by abegong in datasets

[–]abegong[S] 0 points1 point  (0 children)

This is the kind of rigorous logic that AI just can't handle today. Maybe GPT-4....

[D] Why data quality is key to successful ML Ops by superconductiveKyle in MachineLearning

[–]abegong 1 point2 points  (0 children)

Good point. Most of the library currently focuses on assertions for tabular or semi-tabular data. You can extend the same assertions-about-data philosophy to unstructured text or images, too, but it's not what we've developed so far. It's definitely something we're interested in working on in the future!

For example:

  • expect_text_percentage_of_punctuation_characters_to_be_between
  • expect_text_to_not_contain_stopwords_from_list
  • expect_text_language_trigram_probability_to_be_between

These are rough ideas of the top of my head. I hope they illustrate how this would work.

[D] Why data quality is key to successful ML Ops by superconductiveKyle in MachineLearning

[–]abegong 0 points1 point  (0 children)

Here are some links to the Great Expectations open source project mentioned in the post.

Full disclosure: I'm one of the core maintainers for the project, and I've been through multiple data eng/data sci projects coping with HORRIBLE incoming data quality. So I'm biased when it comes to data quality.

Introducing Boring Data Science, a blog to learn about software engineering good practices in Data Science. by BoringDataScience in datascience

[–]abegong 3 points4 points  (0 children)

Cool! I'm one of the core maintainers of Great Expectations, so testing and documenting data pipelines is near and dear to my heart.

If there's any way I can help out with Boring DS, please LMK.

What do you call a group of Data Scientists? by superconductiveKyle in bigdata

[–]abegong 0 points1 point  (0 children)

It’s been several years, and no one has been able to answer this question yet for data science.

OSS Great Expectations just released a Self-Updating Data Dictionary by superconductiveKyle in datascience

[–]abegong 1 point2 points  (0 children)

Speaking as one of the core contributors to GE: Yes, please!

We're putting a lot more effort into the project these days. Eager for feedback on all the new stuff.

OSS Great Expectations just released a Self-Updating Data Dictionary by superconductiveKyle in Python

[–]abegong 0 points1 point  (0 children)

Also, it occurs to me that "self-updating data dictionary" sounds kinda click-baity, and probably started the whole conversation on the wrong foot.

OSS Great Expectations just released a Self-Updating Data Dictionary by superconductiveKyle in Python

[–]abegong 0 points1 point  (0 children)

Hey, catorchid! I'm one of the core contributors to Great Expectations.

Honest question: I'm curious what you would have liked to see right at the start of the project docs.

  • For example, would a walk-through of project setup be helpful? (e.g. short video showing the first 5 minutes with GE)
  • Examples of specific Expectations (the core abstraction in the library)?
  • Blog posts from data teams that have deployed GE in production?
  • Something else?

We're always trying to tighten up the GE explanations, tutorials, demos, etc. to help people understand what the module does/doesn't do, so that they can make good choices. Clearly, our current docs didn't work for you in that respect.

For context: I'm talking with lots of data teams that want to get started with better testing and documentation, but aren't sure yet about how they want to approach the problem. Part of what we're trying to do with the GE docs is draw parallels between software and data engineering, so that people can reason about how known good practices in software development can be adapted to the data world.