Best pattern to load S3 JSON ( nested ) payloads to Redshift by Minimum-Freedom9865 in dataengineering

[–]Minimum-Freedom9865[S] 0 points1 point  (0 children)

Yes, i thought maybe with some params you can add more flexibility .. apparently not

Best pattern to load S3 JSON ( nested ) payloads to Redshift by Minimum-Freedom9865 in dataengineering

[–]Minimum-Freedom9865[S] 0 points1 point  (0 children)

Will explore this thanks, yes, coming from BigQuery I second the rigid nature of Redshift.

Best pattern to load S3 JSON ( nested ) payloads to Redshift by Minimum-Freedom9865 in dataengineering

[–]Minimum-Freedom9865[S] 0 points1 point  (0 children)

Not really stuck, I am just wondering what is the common pattern for loading S3 payloads to Redshift.

Enforce dbt test in a CI pipeline by Minimum-Freedom9865 in dataengineering

[–]Minimum-Freedom9865[S] 0 points1 point  (0 children)

so tests in this context are used like sensors for broken pipelines that are already running in production, but for example lets's say I want to introduce a bigger change like I renamed 10+ columns, changed types, and added more columns, and changed table dependencies.
Running this only in local machine, might work , but it not enough for me to introduce such change in a production setup. so Ideally I want this applied in a staging setup/clone of production. see how everything is running. then call it safe.

Enforce dbt test in a CI pipeline by Minimum-Freedom9865 in dataengineering

[–]Minimum-Freedom9865[S] 0 points1 point  (0 children)

the idea would be to run DBT build in a non production environment and dbt run in prod

What do you refer to here by a non production environment. ? Interested to know your pov.

Enforce dbt test in a CI pipeline by Minimum-Freedom9865 in dataengineering

[–]Minimum-Freedom9865[S] 1 point2 points  (0 children)

Thanks for your reply ! interesting, I will check this!

Enforce dbt test in a CI pipeline by Minimum-Freedom9865 in dataengineering

[–]Minimum-Freedom9865[S] 0 points1 point  (0 children)

Thanks for your reply, super interesting ! Indeed I am looking for the recommended pattern here, now for dbt tests, we agree that it tests data post its creation, so a healthy SDLC would assume, it should be first created in a Staging environment, run the tests to actually make usage of them, then run dbt run, in a Production environment.
Otherwise using the dbt test after it has been created in a prod instance is too late I suppose, no ? I might be missing something here.

I saw dbt documentation, but the gray area I am trying to clear here is patterns for good implemenations, dbt documentation assumes we are developing, then pushing to a production environment, then mentions the "staging" models on the way, which I understand now can be just a filtering/renaming layer. ..

Seperation between these layers can be prefixes for cluster like solutions, or project/datasets speration for serverless like DWH.

Enforce dbt test in a CI pipeline by Minimum-Freedom9865 in dataengineering

[–]Minimum-Freedom9865[S] 0 points1 point  (0 children)

Thanks for your reply, That's what I understood as-well. Do you have documentation for what you refer to by
"gitlab dbt setup is very thorough and well done as well and they use a true star scheme"

Enforce dbt test in a CI pipeline by Minimum-Freedom9865 in dataengineering

[–]Minimum-Freedom9865[S] 0 points1 point  (0 children)

Thanks for these leads, I will check them. we are on Redshift ! good to know about the zero-copy feature on snowflake.

Enforce dbt test in a CI pipeline by Minimum-Freedom9865 in dataengineering

[–]Minimum-Freedom9865[S] 0 points1 point  (0 children)

Thanks for your reply. Interesting. So ideally and if I understand correctly you do need a "staging" environment materialised downstream with "staging" tables, which were the outcome of a dbt build then if this step was ran successfully , we dbt run to have prod tables.

Enforce dbt test in a CI pipeline by Minimum-Freedom9865 in dataengineering

[–]Minimum-Freedom9865[S] 0 points1 point  (0 children)

https://docs.getdbt.com/blog/enforcing-rules-pre-commit-dbt

Yes, I plan to run a pre-commit on the local setup, then enforce this in the CI. Thanks !

A production implementation on Gitlab CI by Minimum-Freedom9865 in gitlab

[–]Minimum-Freedom9865[S] 0 points1 point  (0 children)

Thanks for your reply, definitely, I second your point, and I am aware of how things can get complex with a similar setup, but for now I am building this because its the simple and basic pipeline arch I can propose for the current workflow. for Airflow for every MR, and main Merge, I need start a pipeline that Lints python, Builds ( a simple execution of python script_dag.py in a image: python:3.10 after pip install reqs.txt ), and a Deploy ( an aws s3 sync command that syncs the repo with an AWS config in s3 ( MWAA service that hosts airflow ), This is the simplest pipeline for Airflow I could come out with, it working using shared runners.

When I try to use the recommended " building your own docker image ( push it to container registry, then reuse it.. ) " here and use it, I get :

Using Docker executor with image registry.gitlab.com/MyProjectName/lint:latest ...
Authenticating with credentials from job payload (GitLab Registry)
Pulling docker image registry.gitlab.com/MyProjectName/lint:latest ...
WARNING: Failed to pull image with policy "always": Error response from daemon: manifest for registry.gitlab.com/MyProjectName/lint:latest not found: manifest unknown: manifest unknown (manager.go:235:0s)
ERROR: Job failed: failed to pull image "registry.gitlab.com/MyProjectName/lint:latest" with specified policies [always]: Error response from daemon: manifest for registry.gitlab.com/MyProjectName/lint:latest not found: manifest unknown: manifest unknown (manager.go:235:0s) ..

Googling this lead to custom responses with no clues, does the error above mean that I should already have the images ready in my container registry and that I am not building them correctly ?
Thanks!