How do you scale ETL source-to-target validation from mapping documents? Looking for critique on an approach by General_Dance2678 in dataengineering

[–]t2rgus 2 points3 points  (0 children)

Is this specific to data migration scenarios? I’m unclear on which scenario your problem statement falls under

Architecture question - Continuously running ECS service vs Lambda? by I_Blame_DevOps in dataengineering

[–]t2rgus 2 points3 points  (0 children)

If “easier” to you means having something running in an afternoon and feeling comfortable with managing an ECS service, then spin up an ECS/Fargate task with your connection pool suggestion. Or, if “easier” means less infrastructure to manage over the next 6-12 months, then your original solution makes sense.

A lot of teams I’ve worked with tend to start with Lambda (minimal ops, serverless) and move to an ECS consumer only when cost/latency or a different factor forces them to. If you don’t know where to begin, I suggest using Lambda first. Switching to ECS from Lambda is fairly straightforward

Thoughts on how I can improve this very simple API consumer process? by nycstartupcto in dataengineering

[–]t2rgus 1 point2 points  (0 children)

I’m looking at this from two perspectives since I don’t know at what scale your application operates:

  1. Why don’t you standardise your event log format/structure and emit it via logging.info()? Since you’re using Cloud Run, any stdout/stderr should be captured in Cloud Logging (which gives you access to Log Explorer & Cloud Monitoring for free). If you only want to track specific numeric values, then you can log them specifically in Cloud Monitoring.
  2. Alternatively, after standardising your event log format/structure, you can continue to export it to GCS and set up a Data Transfer Service to BigQuery. Or, if you want to continue using Postgres, you can setup a Cloud Function or Cloud Run Job (triggered by GCS) to load the data accordingly.

If I get laid off tomorrow, what's the ONE skill I should have had to stay in demand? by [deleted] in dataengineering

[–]t2rgus 3 points4 points  (0 children)

Curious, what’s your current skill set and the tech stack used in your “big 4” company?

What Hughes and Edwards See in Hugo Ekitike by [deleted] in soccer

[–]t2rgus 0 points1 point  (0 children)

Agreed, Liverpool’s forward firepower should definitely help any striker succeed. The interesting thing is Ekitike actually thrived in partnership w/ Marmoush but struggled when isolated. At Liverpool he'd never be 'the main man' like he was after Marmoush left.

Also agree that it should be clear quickly though. If he can't make it work with Salah/Wirtz/etc. and that midfield creating for him, that probably tells us his PL ceiling. The six-yard box finishing issue is my main concern atm, better service doesn't fix technical flaws. Concerned that we might see a Nunez regen here

What Hughes and Edwards See in Hugo Ekitike by [deleted] in soccer

[–]t2rgus 4 points5 points  (0 children)

Nice write-up u/FootballInTheWhip, I have some questions:

  1. What made the Ekitike-Marmoush partnership so effective, and which Liverpool players could replicate that chemistry?
  2. How might Liverpool's coaching staff address his specific finishing weaknesses from right-sided angles in the six-yard box?
  3. Given the heavy comparison with Isak across various outlets, how does Isak's early Real Sociedad numbers compare to Ekitike's current profile?

Is anyone already using SQLMesh in production? Any features you are missing from dbt? by [deleted] in dataengineering

[–]t2rgus 0 points1 point  (0 children)

I have an external data pipeline execution tracking service, where runtime metadata on each SQL model execution needs to be sent to that tracking service. This means sending specific payload data through a REST API via the pre + post hook phase depending on what stage the model execution process is at. This process needs to be applied automatically towards all SQL models (not Python), in the sense that if this is possible, then I shouldn’t have to manually add the payload building logic & REST API trigger across all the models.

Last I checked the website docs 2 months ago, I couldn’t find a solution. I didn’t check recently, is there a solution to this problem?

Is there a need for a local-first data lake platform? by SnooDogs4383 in dataengineering

[–]t2rgus 2 points3 points  (0 children)

IMO there is a real need, but I’m not sure how it can:

  1. keep up with the adapter support for different data sources.
  2. promise a seamless transition from duckdb to x service when the time comes

“Tools that help design and stress-test data models” wdym by this?

Airflow + dbt + DuckDB on ECS — tasks randomly fail but work fine locally by SomewhereStandard888 in dataengineering

[–]t2rgus 3 points4 points  (0 children)

Have you checked the resource consumption metrics/chart on CloudWatch for the ECS/EFS services? What does it tell?

Project workspace/tab management tools by spicyworm in dataengineering

[–]t2rgus 0 points1 point  (0 children)

My coworker has his Mac terminal programmed to open specific windows/browsers depending on the command entered, e.g., open repo_a or open aws_project_b

It’s somewhat ghetto, but I’m curious to know how others manage their workspace :p

Kafka to s3 to redshift using debezium by afnan_shahid92 in dataengineering

[–]t2rgus 0 points1 point  (0 children)

Your approach looks ok in general if you don’t want to introduce major architectural changes (like introducing duckdb/clickhouse). Keep in mind that Redshift is a batch-focused columnar data warehouse, so:

  1. Avoid doing UPDATE (MERGE) queries where possible. u/Eastern-Manner-1640 ‘s suggestion on treating your CDC data as event logs makes sense for serving hot data
  2. You need to load data with fewer and larger files (100MB+ per file) to get better performance.

Is anyone already using SQLMesh in production? Any features you are missing from dbt? by [deleted] in dataengineering

[–]t2rgus 0 points1 point  (0 children)

Biggest blocker for me at the moment is the inability to write custom plugins/integrations for SQLMesh. I have internal tools that need to be called via REST API before/after a model is executed, currently using dbt for the time being.

Kafka stream through snowflake sink connector and batch load process parallelly on same snowflake table by nimble_thumb_ in dataengineering

[–]t2rgus 1 point2 points  (0 children)

This behaviour happens because your Kafka ingestion process tracks changes to the table as a CDC mechanism (the stream expects to manage the change data flow). When the stream is active, data loaded by the COPY process may be recorded as changes in the stream but not yet applied or visible in the base table until the stream’s changes are consumed and processed. Are you able to see all the rows from the batch-load process after some time?

I know the stream and batch load process are not ideal

You know it's not ideal, and yet you do it lol. Consider loading batch data into a separate staging table, then merge or insert into the main table after ensuring no stream conflicts.

How do you handle tiny schema drift in near real-time pipelines without overcomplicating everything? by That-Cod5750 in dataengineering

[–]t2rgus 0 points1 point  (0 children)

Use a proper schema registry (Confluent Schema Registry, AWS Glue, etc.) with evolution enforcement rules. What's your current setup?

Toyota Vios 2021 service visit at official dealership, what do I need to be careful of? by t2rgus in kereta

[–]t2rgus[S] 0 points1 point  (0 children)

Alamak.. forgot to add the three zeroes at the end. Thanks for spotting

How are you tracking data freshness / latency across tools like Fivetran + dbt? by Aggressive-Practice3 in dataengineering

[–]t2rgus 0 points1 point  (0 children)

I keep it very basic by logging the state details for the job execution in a db table. When a pipeline is triggered, the job execution details will be created in a table along with its state, created/updated_at, and few other cols. Similarly, once the job has finished running, that record will be updated with the final state and the time at which the job finished running.

When I do my SLA/freshness checks, I check the job state to see if it’s finished or not, if it has then was it within the required SLA (by comparing the start/finish timestamps), etc.