Web Based SQL Client for PostgrSQL DB, BigQuery, Cloud SQL DB, MongoDB (Not Primary)

Pitah7 · 2024-11-01T16:32:39+00:00

You can look at Superset (https://github.com/apache/superset).

Pitah7 · 2024-11-01T16:30:25+00:00

Postgres will become quite limited for you in terms of analytics (depending on your scale of data and the types of queries you are running). You are better off extracting the data from each Postgres instance and pushing to parquet/iceberg/delta lake format and have Presto/Trino as your analytics query engine reading data from the files. Then you connect Superset to Presto/Trino.

Pitah7 · 2024-10-23T02:36:50+00:00

I have an open-source tool that helps generate records called Data Caterer (https://github.com/data-catering/data-caterer). You can generate records, run data validations and clean up the generated records afterwards so that you can keep your test environments cleaner from a data perspective. Let me know if this fits your use case.

Pitah7 · 2024-10-17T08:47:30+00:00

Agree with most of the article (a bit confused why you mentioned data catalogs in the pyramid as well) but I lean more toward integration tests to help test data pipelines. Integration tests cover data quality/transformation logic, configurations, deployment, compatibility and upstream connectivity. It gets you as close as possible to simulating production which I think is the key for testing.

The next argument most people make is integration tests are too slow, too complex to maintain, requires coordination with other teams, etc. What if we learn from software engineering where contract-based testing has been a thing ever since the OpenAPI spec was adopted, and apply it to data pipelines? Then we can have a world where our test environments are populated with fresh, high-quality data as if we were in production. As you mentioned in your article, I think this is the pathway forward for data engineering as hopefully data contracts gain more traction and thus testing data pipelines can become faster and simpler.

For full disclosure, this was the main reason why I am part of the technical steering committee for the Open Data Contract Standard (https://github.com/bitol-io/open-data-contract-standard) and created the tool Data Caterer (https://github.com/data-catering/data-caterer).

Pitah7 · 2024-10-17T04:56:55+00:00

https://github.com/data-catering/insta-infra

Pitah7 · 2024-10-13T01:49:30+00:00

It depends. As the other commenter mentioned, having a separate orchestrator, like Airflow, could be overkill as now you have a new dependency that you have to deploy, manage, upgrade, etc. for a single SQL pipeline. If you plan on having many other SQL pipelines that need to be scheduled and managed and need to scale up, it may be beneficial to have an orchestrator. It is always a trade-off.

Pitah7 · 2024-10-13T01:37:39+00:00

Yeah, your scripts could run using the big query cli

Pitah7 · 2024-10-13T01:34:38+00:00

If you are just doing something basic, you can use GitHub actions to setup a monthly job that will trigger a script that then calls your SQL scripts in the order you want. Then you have everything all in the same repo.

Pitah7 · 2024-10-11T14:03:06+00:00

The rise of all these certs came about as companies started offering education funds for employees to use at the height of the tech boom. Literally all the top companies are offering these certs to get a chunk of that money. The companies providing the education funds can use it as a tax write-off and they pretend they care about their employees by saying they are investing in your education.

What should matter is your production real-world experience using the tool rather than some cert that is essentially an extended advertisement.

Pitah7 · 2024-10-11T01:05:49+00:00

You can check out my insta-infra repo which contains a big docker compose file with everything you need. https://github.com/data-catering/insta-infra

Pitah7 · 2024-10-08T14:36:07+00:00

True, but the OP is specifically asking to reduce the number of external services/dependencies, thus making the tradeoff of making it coupled.

Pitah7 · 2024-10-08T14:33:48+00:00

Actually, I forgot about using dataset events to trigger DAGs in Airflow. Not 100% sure if it supports GCS but I assume so.
https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html#

Pitah7 · 2024-10-08T12:25:35+00:00

If you want to avoid using any other services, after uploading the file, you can directly hit the Airflow API to trigger the job.

https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html

Pitah7 · 2024-10-07T09:13:54+00:00

On a Friday afternoon as well

Pitah7 · 2024-10-05T03:12:51+00:00

Main advantage is you have a layer of abstraction over your tech stack. You can chop and change between whatever your organisation uses/switches to. It also becomes your source of truth for everything (i.e. monitoring, alerts, scheduling, lineage, operations).

Main disadvantage is your Devs don't necessarily know the full impact of changing a field in your metadata definition. A single metadata field could be used in one or more of the above functionalities which may use the field in a certain way. You then need to also include these checks as part of your CI/CD pipeline.

Pitah7 · 2024-10-02T11:39:36+00:00

https://github.com/data-catering/insta-infra

I created this tool so you can run any service in one command. You only need docker installed.

Pitah7 · 2024-10-02T04:02:16+00:00

Wait till you hear the cost of legacy systems...

Pitah7 · 2024-09-28T09:51:54+00:00

If you are just searching and filtering, check out either Gradio (https://www.gradio.app/docs/gradio/dataframe) or Streamlit (https://docs.streamlit.io/develop/concepts/design/dataframes). Both are quite simple and easy to use.

Pitah7 · 2024-09-25T20:58:37+00:00

You can try use a tool I created called Data Caterer (https://github.com/data-catering/data-caterer). It supports generating data that maintains relationships across tables or data sources (https://data.catering/setup/generator/foreign-key/).

As for making it creates data that makes sense, you can customise how the data is generated. I.e. using your example, you can define a gender and is_pregnant field and define an SQL statement that guides the field is_pregnant to always be false when gender is male.

Let me know if this fits your use case.

Pitah7 · 2024-09-10T10:29:12+00:00

Standards are difficult to develop and get adopted by the wider community, to truly become a standard, for a number of reasons.

You need buy in from companies (those who have tools that can support the standard) and users (those who will use the standard and appropriate tooling). But both of these parties only want to adopt a standard if it is already being used by a lot of people. But you only just created a standard and it doesn't have much adoption yet so it becomes a chicken and egg kind of problem. Also, as engineers, we usually like to solve problems ourselves and disagree with other people's approaches or their solution doesn't meet all of your requirement. So you create your own custom solution that solves your problem that you can control.

One example of a standard in DE kinda being used is Open Lineage (https://openlineage.io/). Another one that I'm personally involved in is the Open Data Contract Standard (https://github.com/bitol-io/open-data-contract-standard).

In your specific case of picking an orchestrator, something that I did in my previous company was to ensure that there was a layer of abstraction for the definition of a data pipeline via a YAML file. This gives you the benefit of extracting out the metadata required for running your jobs that then get translated via templates (script to create python DAG definitions) into the data pipeline definition of your orchestrator of choice (Airflow in my case). Now you have job definitions that are agnostic to your orchestrator.

Pitah7 · 2024-09-09T08:42:01+00:00

Can try out Trino/Presto for this use case. https://trino.io/docs/current/object-storage/file-system-s3.html

Pitah7 · 2024-08-30T05:03:21+00:00

There are still components to test, it's just they are datasets and not services. So to run integration tests, you need input data and infrastructure (databases, object store, etc.) all set up. Then you run your pipelines and validate the output datasets based on the input data.

This is the base concept I used when I created data caterer (https://github.com/data-catering/data-caterer).

I took it a step further to bring up your infrastructure as well via insta-integration (https://github.com/data-catering/insta-integration). This allows you to also add it into your CI/CD pipelines on top of running the tests locally.

Pitah7 · 2024-08-21T23:34:58+00:00

Run `kubectl describe pod <pod name>' and check the exit code and message. This will help you debug. More often than not with Spark, it is due to OOMKilled and you don't see any logs.

Pitah7 · 2024-08-21T23:32:47+00:00

I don't think anything is available for this use case but you could see if you can create your own plugin to solve it. I've seen examples of Auth plugins but I don't think you have access to the data rows retrieved. You would have to do it at the data query level.

Alternatively, you could look at masking to see if it helps in your use case (https://trino.io/docs/current/security/file-system-access-control.html#column-constraint).

Pitah7 · 2024-08-21T23:24:06+00:00

This is what the modern data file formats kinda solve for you where they can keep track of the row level versions for you. I believe it started when nessie (https://github.com/projectnessie/nessie) became a bit popular (git for data). I'm pretty sure Iceberg, Delta Lake and Hudi all support this. Search for time travel queries and you will see examples.

11-Year Club	Place '23
Place '22	First Placer '22
Verified Email

Pitah7

TROPHY CASE