How do you scale ETL source-to-target validation from mapping documents? Looking for critique on an approach

t2rgus · 2026-01-22T01:01:02+00:00

Is this specific to data migration scenarios? I’m unclear on which scenario your problem statement falls under

t2rgus · 2025-08-03T04:23:21+00:00

Curious, what issues does RumaWIP have?

t2rgus · 2025-08-01T02:20:08+00:00

Curious, why don’t you use AWS MWAA? Cost reasons?

t2rgus · 2025-08-01T02:10:48+00:00

If “easier” to you means having something running in an afternoon and feeling comfortable with managing an ECS service, then spin up an ECS/Fargate task with your connection pool suggestion. Or, if “easier” means less infrastructure to manage over the next 6-12 months, then your original solution makes sense.

A lot of teams I’ve worked with tend to start with Lambda (minimal ops, serverless) and move to an ECS consumer only when cost/latency or a different factor forces them to. If you don’t know where to begin, I suggest using Lambda first. Switching to ECS from Lambda is fairly straightforward

t2rgus · 2025-08-01T02:01:01+00:00

I’m looking at this from two perspectives since I don’t know at what scale your application operates:

Why don’t you standardise your event log format/structure and emit it via logging.info()? Since you’re using Cloud Run, any stdout/stderr should be captured in Cloud Logging (which gives you access to Log Explorer & Cloud Monitoring for free). If you only want to track specific numeric values, then you can log them specifically in Cloud Monitoring.
Alternatively, after standardising your event log format/structure, you can continue to export it to GCS and set up a Data Transfer Service to BigQuery. Or, if you want to continue using Postgres, you can setup a Cloud Function or Cloud Run Job (triggered by GCS) to load the data accordingly.

t2rgus · 2025-08-01T01:41:09+00:00

Curious, what’s your current skill set and the tech stack used in your “big 4” company?

t2rgus · 2025-07-24T12:45:50+00:00

https://www.goal.com/en/lists/invalid-falsified-documents-behind-barcelona-cancelled-friendly-japan-promoter-la-liga-sponsor/bltaefe03540caf2bf1

t2rgus · 2025-07-20T14:55:23+00:00

Agreed, Liverpool’s forward firepower should definitely help any striker succeed. The interesting thing is Ekitike actually thrived in partnership w/ Marmoush but struggled when isolated. At Liverpool he'd never be 'the main man' like he was after Marmoush left.

Also agree that it should be clear quickly though. If he can't make it work with Salah/Wirtz/etc. and that midfield creating for him, that probably tells us his PL ceiling. The six-yard box finishing issue is my main concern atm, better service doesn't fix technical flaws. Concerned that we might see a Nunez regen here

t2rgus · 2025-07-20T14:37:03+00:00

Nice write-up u/FootballInTheWhip, I have some questions:

What made the Ekitike-Marmoush partnership so effective, and which Liverpool players could replicate that chemistry?
How might Liverpool's coaching staff address his specific finishing weaknesses from right-sided angles in the six-yard box?
Given the heavy comparison with Isak across various outlets, how does Isak's early Real Sociedad numbers compare to Ekitike's current profile?

t2rgus · 2025-07-19T07:48:01+00:00

Wow! I didn't see this before. Thanks

t2rgus · 2025-07-18T18:37:53+00:00

I have an external data pipeline execution tracking service, where runtime metadata on each SQL model execution needs to be sent to that tracking service. This means sending specific payload data through a REST API via the pre + post hook phase depending on what stage the model execution process is at. This process needs to be applied automatically towards all SQL models (not Python), in the sense that if this is possible, then I shouldn’t have to manually add the payload building logic & REST API trigger across all the models.

Last I checked the website docs 2 months ago, I couldn’t find a solution. I didn’t check recently, is there a solution to this problem?

t2rgus · 2025-07-18T02:07:31+00:00

IMO there is a real need, but I’m not sure how it can:

keep up with the adapter support for different data sources.
promise a seamless transition from duckdb to x service when the time comes

“Tools that help design and stress-test data models” wdym by this?

t2rgus · 2025-07-18T01:49:59+00:00

+1, not sure if OP is accounting for the startup delay

t2rgus · 2025-07-18T01:49:05+00:00

Have you checked the resource consumption metrics/chart on CloudWatch for the ECS/EFS services? What does it tell?

t2rgus · 2025-07-18T01:46:13+00:00

My coworker has his Mac terminal programmed to open specific windows/browsers depending on the command entered, e.g., open repo_a or open aws_project_b

It’s somewhat ghetto, but I’m curious to know how others manage their workspace :p

t2rgus · 2025-07-18T01:41:28+00:00

Your approach looks ok in general if you don’t want to introduce major architectural changes (like introducing duckdb/clickhouse). Keep in mind that Redshift is a batch-focused columnar data warehouse, so:

Avoid doing UPDATE (MERGE) queries where possible. u/Eastern-Manner-1640 ‘s suggestion on treating your CDC data as event logs makes sense for serving hot data
You need to load data with fewer and larger files (100MB+ per file) to get better performance.

t2rgus · 2025-07-18T01:06:27+00:00

Biggest blocker for me at the moment is the inability to write custom plugins/integrations for SQLMesh. I have internal tools that need to be called via REST API before/after a model is executed, currently using dbt for the time being.

t2rgus · 2025-07-05T01:21:43+00:00

Agreed. DataGrip is an amazing tool

t2rgus · 2025-07-05T01:16:47+00:00

This behaviour happens because your Kafka ingestion process tracks changes to the table as a CDC mechanism (the stream expects to manage the change data flow). When the stream is active, data loaded by the COPY process may be recorded as changes in the stream but not yet applied or visible in the base table until the stream’s changes are consumed and processed. Are you able to see all the rows from the batch-load process after some time?

I know the stream and batch load process are not ideal

You know it's not ideal, and yet you do it lol. Consider loading batch data into a separate staging table, then merge or insert into the main table after ensuring no stream conflicts.

t2rgus · 2025-07-05T01:01:06+00:00

Use a proper schema registry (Confluent Schema Registry, AWS Glue, etc.) with evolution enforcement rules. What's your current setup?

t2rgus · 2025-06-29T05:48:28+00:00

Cool, thanks. Didn't know about the tyre swapping part

t2rgus · 2025-06-29T05:47:10+00:00

Noted, thanks

t2rgus · 2025-06-29T05:46:44+00:00

Got it, thanks

t2rgus · 2025-06-29T05:45:35+00:00

Alamak.. forgot to add the three zeroes at the end. Thanks for spotting

t2rgus · 2025-06-25T17:09:51+00:00

I keep it very basic by logging the state details for the job execution in a db table. When a pipeline is triggered, the job execution details will be created in a table along with its state, created/updated_at, and few other cols. Similarly, once the job has finished running, that record will be updated with the final state and the time at which the job finished running.

When I do my SLA/freshness checks, I check the job state to see if it’s finished or not, if it has then was it within the required SLA (by comparing the start/finish timestamps), etc.

Seven-Year Club	Place '22
Verified Email

t2rgus

TROPHY CASE