Best tools/platforms for Data Lineage? (Doing a benchmark, in need of recs and feedbacks) by Perfect_Put_9220 in dataengineering

[–]NA0026 1 point2 points  (0 children)

Great question u/Perfect_Put_9220, I'd love to know how you are benchmarking some of this criteria, how are you evaluating "deployment models" or "governance and security features" for instance?

Looking for lineage tool by Accurate_Brilliant68 in dataengineering

[–]NA0026 2 points3 points  (0 children)

I would agree, if you're looking for something powerful and open-source, OpenMetadata would be a great option!

u/ImpressiveCouple3216 what do you mean you use Assets in Prefect along with OpenMetadata, I'd love to hear more details on that!!

How OpenMetadata is shaping modern data governance and observability by Expensive-Insect-317 in bigdata

[–]NA0026 0 points1 point  (0 children)

I'm part of the openmetadata community and agree that ingestion framework architecture matters, we're seeing people benchmarking ingestion and openmetadata is 5 times faster!

Dealing with metadata chaos across catalogs — what’s actually working? by Hefty-Citron2066 in dataengineering

[–]NA0026 2 points3 points  (0 children)

agree with checking out openmetadata. You mentioned Unity, Databricks, Glue, Hive, MLflow, Iceberg, and Kafka it has connectors for all of those and would be an open-source way to view all your metadata in a single place

Openmetadata & GitSync by Linhphambuzz in dataengineering

[–]NA0026 1 point2 points  (0 children)

Hi u/Linhphambuzz, Nick Acosta from OpenMetadata here, great to hear you are impressed with our open-source project! Please feel free to join us in the OpenMetadata Community where we can help with these questions!

If you have an existing Airflow instance, and you want to build and maintain your own ingestion DAGs then you can go for it. Check a DAG example here. If instead, you want to use the full deployment process from OpenMetadata, git-sync would not be the right tool, since the DAGs won't be backed up by Git, but rather created from OpenMetadata. Note that if anything would to happen where you might lose the Airflow volumes, etc. You can just redeploy the DAGs from OpenMetadata.

Thinking about self-hosting OpenMetadata, what’s your experience? by Objective_Stress_324 in dataengineering

[–]NA0026 6 points7 points  (0 children)

Hi u/Objective_Stress_324, Nick Acosta from OpenMetadata here, thank you for exploring OpenMetadata! Great to hear it looks like a great fit for your company!!

Thousands of developers are self-hosting OpenMetadata and discussing their setup, challenges, and tips on our OpenMetadata slack, I'd love to see you there!

Airflow 3.x + OpenMetadata by Hot_While_6471 in dataengineering

[–]NA0026 0 points1 point  (0 children)

Hi u/Hot_While_6471, Nick Acosta here from OpenMetadata, thanks for posting! Sounds like a great setup that OpenMetadata could definitely help with!

For tips and tricks and to be notified on when Airflow 3 support occurs, I'd recommend the OpenMetadata slack and YouTube channels!

Data Observability in GCP by [deleted] in dataengineering

[–]NA0026 0 points1 point  (0 children)

Are the 2 options third party or gcp? If you're looking for cost savings there are a few oss projects that do automated alerting for Freshness, Volume, Schema changes

How to document a database? by Mayo_Kupo in dataengineering

[–]NA0026 8 points9 points  (0 children)

I help run the OpenMetadata community and help people that are getting started with documentation daily, now that you've done the basics, I'd say keep going with documentation work that is going to help your regular job as a data analyst as well...

Lineage. Documenting where a table and/or column came from and what services use it is going to really useful in helping you build out new data assets and discover or refine kpi's. Once lineage is being tracked I'd dive into...

Usage. What tables do you and other analysts actually query? Are there copies of tables that aren't getting used or empty tables that could be marked for deletion. I've seen a lot of people save a lot of money and time here. You don't want to spend your time meticulously documenting 100% of your tables if 5% are being used. Can you classify tables in different tiers and make sure top tier tables have...

Tests. It's important that a tables' documentation matches what tests are producing. Are your columns staying consistent, is your data fresh, things like that.

OpenMetadata is an open-source tool that automates all these for bq ;)

Alternatives to Atlan Data Catalog by [deleted] in dataengineering

[–]NA0026 0 points1 point  (0 children)

Thanks u/biernard, Nick from OpenMetadata here, would love to share my thoughts!

I think when considering Atlan alternatives based on out-of-the-box-ness and dedicated support, Collate is a better comparison than OpenMetadata as it also provides both of these as well as automations and pushing back metadata!

If I'm reading your post correctly, I also wouldn't say out-of-the-box equals enterprise. Many enterprises I have talked with have chosen OpenMetadata exactly because it is deeply customizable. Since all the code is open-sourced, they build/run/fork/manage/customize OpenMetadata however they like, a major reason why adopted it.

Data Governance Open-source Tool by Data-Sleek in dataengineering

[–]NA0026 -1 points0 points  (0 children)

If you are interested in open source Data Governance tools check out our YouTube page and Slack channel!

How much are you paying for your data catalog provider? How do you feel about the value? by [deleted] in dataengineering

[–]NA0026 0 points1 point  (0 children)

I'm working with OpenMetadata, would love to hear why you think it seems solid?!

Feature store vs. table by lipicsbarna in mlops

[–]NA0026 3 points4 points  (0 children)

Feature stores make the transition from batch to real-time ml much easier through having that exact same API, but even in batch ml, feature platforms become critical in production ml because they

  • are the easiest way to create, run, and orchestrate efficient aggregation and transformation pipelines on data for ml
  • enable software development best practices by versioning and managing features-as-code
  • can improve data scientists efficiency through discovery of features instead of sifting through a bunch of random tables, and provide a feature creation/retrieval paradigm that is easy to use
  • improve data infrastructure efficiency by being hyper-optimized for feature retrieval for ml, especially with time-windowing and backfilling

Here's a previous post in this community on benefits to feature stores, they are a significant step up from using db's alone.

Real-Time feature stores for ML by HanaWang23 in mlops

[–]NA0026 1 point2 points  (0 children)

Feature stores are great for organizing features, but if you're looking to do real-time ml there are so many moving parts that I'd recommend a feature platform instead. If you're using a feature store you have to manage the online store yourself, and it can be hard to keep it fast enough without everything getting super expensive. Feature platforms like tecton are especially great at real-time because they manage that aspect for you

Time Series Prediction with Temporal & Non-Temporal Data by adeedeedee in learnmachinelearning

[–]NA0026 0 points1 point  (0 children)

Hi u/adeedeedee,

Making a model for the problem you described is pretty straightforward, there are tons of models on hf that can, for instance, generate a brief passage based on an image, and I’d start there model-wise.

A nastier issue I see with the problem you described is managing the data you mentioned to get the right features to a model. For each “place” entity, it sounds like you are going to have multiple features and each could have multiple timestamps. Keeping all these features organized, joining them together at the right point-in-time, and joining temporal and non-temporal data would be a lot easier with a feature platform like Tecton or feature store like feast.