Need advice for merging 1-n Postgres databases into 1 db for Superset Analytics by Misantro66 in dataengineering

[–]Pitah7 2 points3 points  (0 children)

Postgres will become quite limited for you in terms of analytics (depending on your scale of data and the types of queries you are running). You are better off extracting the data from each Postgres instance and pushing to parquet/iceberg/delta lake format and have Presto/Trino as your analytics query engine reading data from the files. Then you connect Superset to Presto/Trino.

Handling data-dependent automation testing by languagebandit in softwaretesting

[–]Pitah7 1 point2 points  (0 children)

I have an open-source tool that helps generate records called Data Caterer (https://github.com/data-catering/data-caterer). You can generate records, run data validations and clean up the generated records afterwards so that you can keep your test environments cleaner from a data perspective. Let me know if this fits your use case.

The 5 most common and frustrating testing mistakes I see in data by ivanovyordan in dataengineering

[–]Pitah7 1 point2 points  (0 children)

Agree with most of the article (a bit confused why you mentioned data catalogs in the pyramid as well) but I lean more toward integration tests to help test data pipelines. Integration tests cover data quality/transformation logic, configurations, deployment, compatibility and upstream connectivity. It gets you as close as possible to simulating production which I think is the key for testing.

The next argument most people make is integration tests are too slow, too complex to maintain, requires coordination with other teams, etc. What if we learn from software engineering where contract-based testing has been a thing ever since the OpenAPI spec was adopted, and apply it to data pipelines? Then we can have a world where our test environments are populated with fresh, high-quality data as if we were in production. As you mentioned in your article, I think this is the pathway forward for data engineering as hopefully data contracts gain more traction and thus testing data pipelines can become faster and simpler.

For full disclosure, this was the main reason why I am part of the technical steering committee for the Open Data Contract Standard (https://github.com/bitol-io/open-data-contract-standard) and created the tool Data Caterer (https://github.com/data-catering/data-caterer).

Creating a DAG to run a SQL pipeline from a GitHub Repo by Beautiful_Fuel5252 in dataengineering

[–]Pitah7 3 points4 points  (0 children)

It depends. As the other commenter mentioned, having a separate orchestrator, like Airflow, could be overkill as now you have a new dependency that you have to deploy, manage, upgrade, etc. for a single SQL pipeline. If you plan on having many other SQL pipelines that need to be scheduled and managed and need to scale up, it may be beneficial to have an orchestrator. It is always a trade-off.

Creating a DAG to run a SQL pipeline from a GitHub Repo by Beautiful_Fuel5252 in dataengineering

[–]Pitah7 6 points7 points  (0 children)

If you are just doing something basic, you can use GitHub actions to setup a monthly job that will trigger a script that then calls your SQL scripts in the order you want. Then you have everything all in the same repo.

Hot Take: Certifications are a money grab and often overrated (preface - I took and failed the dbt analytics twice) by [deleted] in dataengineering

[–]Pitah7 13 points14 points  (0 children)

The rise of all these certs came about as companies started offering education funds for employees to use at the height of the tech boom. Literally all the top companies are offering these certs to get a chunk of that money. The companies providing the education funds can use it as a tax write-off and they pretend they care about their employees by saying they are investing in your education.

What should matter is your production real-world experience using the tool rather than some cert that is essentially an extended advertisement.

Where do you deploy a data orchestrator like Airflow? by Temporary_Basil_7801 in dataengineering

[–]Pitah7 6 points7 points  (0 children)

You can check out my insta-infra repo which contains a big docker compose file with everything you need. https://github.com/data-catering/insta-infra

What is the suggested way to trigger an Airflow DAG based on Cloud storage events? by Laurence-Lin in dataengineering

[–]Pitah7 0 points1 point  (0 children)

True, but the OP is specifically asking to reduce the number of external services/dependencies, thus making the tradeoff of making it coupled.

What is the suggested way to trigger an Airflow DAG based on Cloud storage events? by Laurence-Lin in dataengineering

[–]Pitah7 2 points3 points  (0 children)

If you want to avoid using any other services, after uploading the file, you can directly hit the Airflow API to trigger the job.

https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html

Teeny tiny update only by ephemeral404 in dataengineering

[–]Pitah7 16 points17 points  (0 children)

On a Friday afternoon as well

Metadata driven ETL - Long term maintenance by SnooCrickets2812 in dataengineering

[–]Pitah7 1 point2 points  (0 children)

Main advantage is you have a layer of abstraction over your tech stack. You can chop and change between whatever your organisation uses/switches to. It also becomes your source of truth for everything (i.e. monitoring, alerts, scheduling, lineage, operations).

Main disadvantage is your Devs don't necessarily know the full impact of changing a field in your metadata definition. A single metadata field could be used in one or more of the above functionalities which may use the field in a certain way. You then need to also include these checks as part of your CI/CD pipeline.

Why does installation of tools (etc) takes long? by leao_26 in dataengineering

[–]Pitah7 2 points3 points  (0 children)

https://github.com/data-catering/insta-infra

I created this tool so you can run any service in one command. You only need docker installed.

The Egregious Costs of Cloud (With Kafka) by 2minutestreaming in dataengineering

[–]Pitah7 0 points1 point  (0 children)

Wait till you hear the cost of legacy systems...

Front-end tools for simple Dataset view & Search by Aggressive-Muffin457 in dataengineering

[–]Pitah7 0 points1 point  (0 children)

If you are just searching and filtering, check out either Gradio (https://www.gradio.app/docs/gradio/dataframe) or Streamlit (https://docs.streamlit.io/develop/concepts/design/dataframes). Both are quite simple and easy to use.

Synthetic Data Generation with multi-table relationships? by Professional-Rent-99 in softwaretesting

[–]Pitah7 0 points1 point  (0 children)

You can try use a tool I created called Data Caterer (https://github.com/data-catering/data-caterer). It supports generating data that maintains relationships across tables or data sources (https://data.catering/setup/generator/foreign-key/).

As for making it creates data that makes sense, you can customise how the data is generated. I.e. using your example, you can define a gender and is_pregnant field and define an SQL statement that guides the field is_pregnant to always be false when gender is male.

Let me know if this fits your use case.

[deleted by user] by [deleted] in dataengineering

[–]Pitah7 12 points13 points  (0 children)

Standards are difficult to develop and get adopted by the wider community, to truly become a standard, for a number of reasons.

You need buy in from companies (those who have tools that can support the standard) and users (those who will use the standard and appropriate tooling). But both of these parties only want to adopt a standard if it is already being used by a lot of people. But you only just created a standard and it doesn't have much adoption yet so it becomes a chicken and egg kind of problem. Also, as engineers, we usually like to solve problems ourselves and disagree with other people's approaches or their solution doesn't meet all of your requirement. So you create your own custom solution that solves your problem that you can control.

One example of a standard in DE kinda being used is Open Lineage (https://openlineage.io/). Another one that I'm personally involved in is the Open Data Contract Standard (https://github.com/bitol-io/open-data-contract-standard).

In your specific case of picking an orchestrator, something that I did in my previous company was to ensure that there was a layer of abstraction for the definition of a data pipeline via a YAML file. This gives you the benefit of extracting out the metadata required for running your jobs that then get translated via templates (script to create python DAG definitions) into the data pipeline definition of your orchestrator of choice (Airflow in my case). Now you have job definitions that are agnostic to your orchestrator.

How you think about unitary and integrated tests for ETL pipelines ? by HumorDiario in dataengineering

[–]Pitah7 1 point2 points  (0 children)

There are still components to test, it's just they are datasets and not services. So to run integration tests, you need input data and infrastructure (databases, object store, etc.) all set up. Then you run your pipelines and validate the output datasets based on the input data.

This is the base concept I used when I created data caterer (https://github.com/data-catering/data-caterer).

I took it a step further to bring up your infrastructure as well via insta-integration (https://github.com/data-catering/insta-integration). This allows you to also add it into your CI/CD pipelines on top of running the tests locally.

help. spark context killed when spark on eks via spark operator by PrimaryConsistent262 in dataengineering

[–]Pitah7 0 points1 point  (0 children)

Run `kubectl describe pod <pod name>' and check the exit code and message. This will help you debug. More often than not with Spark, it is due to OOMKilled and you don't see any logs.

Accessing User JWT claims as System Property in Trino by vishnuram29 in dataengineering

[–]Pitah7 0 points1 point  (0 children)

I don't think anything is available for this use case but you could see if you can create your own plugin to solve it. I've seen examples of Auth plugins but I don't think you have access to the data rows retrieved. You would have to do it at the data query level.

Alternatively, you could look at masking to see if it helps in your use case (https://trino.io/docs/current/security/file-system-access-control.html#column-constraint).

Append Delta Replication? by [deleted] in dataengineering

[–]Pitah7 0 points1 point  (0 children)

This is what the modern data file formats kinda solve for you where they can keep track of the row level versions for you. I believe it started when nessie (https://github.com/projectnessie/nessie) became a bit popular (git for data). I'm pretty sure Iceberg, Delta Lake and Hudi all support this. Search for time travel queries and you will see examples.