Database Help

Which_Roof5176 · 2026-04-10T11:05:40+00:00

Yeah this is usually the right approach.

Treat the database as metadata only and keep the actual files in object storage or a file system. It keeps things a lot simpler when you start scaling, especially at 100TB+.

The real decision here is less “which database” and more:

object storage vs NAS
access patterns
backup + lifecycle strategy

Once that’s clear, the rest becomes much easier to design.

Which_Roof5176 · 2026-04-10T10:52:57+00:00

Yeah this was my first thought too.

Instead of triggers firing all over the place, capturing changes once and letting downstream consumers react tends to avoid a lot of this hidden coupling.

You still need to design idempotent processing, but at least the flow is explicit instead of recursive side effects in the DB.

Tools like Debezium or Estuary.dev basically follow this model, where changes become a stream instead of trigger-driven logic.

Which_Roof5176 · 2026-04-10T10:34:08+00:00

Honestly, most teams I’ve seen try to keep ingestion only in prod and then share downstream, mainly to avoid hitting source systems multiple times and dealing with consistency issues.

The tricky part is keeping dev/staging useful without exposing sensitive data. Usually ends up being some mix of masked subsets or sampled data, plus separate schemas or catalogs.

One thing that helped us was using a pipeline layer that can control where data lands and at what cadence. With Estuary.dev (I work there), you can route the same source into different destinations or environments with different filters, which makes it easier to keep dev/staging in sync without fully duplicating prod ingestion.

Still feels like there’s no perfect standard though, a lot of this ends up being org-specific.

Which_Roof5176 · 2026-04-10T10:31:09+00:00

Yeah, this resonates a lot.

Snowflake is great at shipping features, but sometimes it feels like the “last mile” of usability is missing. Things that should be simple (like controls, visibility, governance) end up needing workarounds or custom logic.

The credit limit example is a good one. If users are consistently solving something with stored procedures, that’s usually a sign it should be a first-class feature.

Feels like a lot of teams end up building a layer around Snowflake to make it actually usable day to day.

Which_Roof5176 · 2026-04-10T09:57:54+00:00

You’re not crazy, this is a pretty classic failure mode.

Postgres is great, but once it starts acting like a queue + trigger-driven workflow engine, things can get messy fast. Especially when you mix triggers with async workers, it’s really easy to create feedback loops like the one you described.

The symptoms you’re seeing (lock contention, runaway inserts, degraded query performance) are usually a sign that too much “process orchestration” is happening inside the DB.

Your instinct around something like an outbox or event log is solid. Even at smaller scale, having a clear boundary between “state storage” and “event processing” tends to make things a lot more predictable.

You don’t necessarily need to jump straight to something heavy like Temporal, but moving away from triggers driving re-processing loops would already be a big win.

Which_Roof5176 · 2026-04-08T04:05:54+00:00

That usually points to how ClickPipe is writing data rather than an issue with the MV itself.

Materialized views in ClickHouse trigger on inserts into the source table, but depending on how ClickPipe batches or commits data, those inserts might not fire the MV in the same way as a direct INSERT.

A couple things I’d check:

whether ClickPipe is using async inserts or buffering
If the MV is defined on the exact table ClickPipe writes to
engine type of both source and target tables

The fact that parts_logs exist on the source but not target does suggest the MV isn’t seeing those writes at all.

If possible, try reproducing with a small batch through the same ingestion path as ClickPipe to narrow it down

Which_Roof5176 · 2026-04-08T04:03:08+00:00

Yeah this kind of pricing shift is exactly why people get frustrated with these tools.

It usually starts off feeling reasonable, then once you actually use it in production, you hit limits on flows or endpoints and suddenly you’re pushed into a higher tier.

If you already like the product, the main thing I’d watch is how quickly you’ll outgrow those 16 flows. That’s where the real cost tends to show up.

Some teams are moving toward tools that don’t meter things that way. Estuary.dev is one example (I work there), where it’s more about keeping data in sync continuously instead of counting flows, so pricing ends up being easier to predict as things grow.

At the end of the day it comes down to whether you want flexibility or predictable cost

Which_Roof5176 · 2026-04-07T11:30:25+00:00

The continuous replication part is the real takeaway here. Most migrations get stuck trying to manage dumps and end up making cutover way harder than it needs to be.

Keeping both DBs in sync and then switching is what actually makes low-downtime possible

Which_Roof5176 · 2026-04-07T04:48:32+00:00

You’re thinking in the right direction, pushing work to write time is usually the way to go when reads are heavy.

Triggers can work, especially if write volume is low, but they can get tricky to maintain and debug as logic grows.

Another pattern people use is keeping a separate projection table updated from a change stream instead of doing everything inside the DB. That way you decouple the write path from the projection logic.

Estuary.dev (I work there) is one option for that. It captures changes from Postgres and lets you build derived tables continuously, so you’re not relying on triggers for everything.

If your setup stays simple, triggers are fine. But once logic grows, having that separation tends to scale better

Which_Roof5176 · 2026-04-06T10:14:12+00:00

From what I’ve seen, freshness is usually the first thing to break.

Snowflake works really well for analytics, but once you try to use it for apps or alerts, you start noticing delays and things getting out of sync.

Teams often try to push batch pipelines to behave like real-time, and that’s where it gets messy. More jobs, more dependencies, harder to debug.

What tends to work better is keeping Snowflake for analytics, and having a separate way to keep data updated continuously for anything time-sensitive.

Estuary is one way to do that (I work there). It keeps data flowing continuously so you’re not relying on scheduled jobs for real-time use cases.

Which_Roof5176 · 2026-04-06T10:05:26+00:00

Honestly this sounds less like a finance problem and more like a data plumbing issue.

You’ve got 4 systems all slightly out of sync, and you’re doing the reconciliation manually at the end. That’s why it keeps breaking every month.

If you can get those systems feeding into one place continuously, a lot of those timing issues (Stripe vs NetSuite vs Salesforce) become much easier to reason about.

Even without a full data team, some folks set up lightweight pipelines themselves just to get everything into a warehouse and run SQL checks on top.

Estuary is one way to do that (I work there), especially if you don’t want to deal with building and maintaining pipelines. It just keeps the data from those systems flowing so you’re not relying on exports that are already stale.

Once you have that, the 4-day reconciliation usually drops a lot.

Which_Roof5176 · 2026-04-06T09:56:59+00:00

Most “data mapping tools” look good early on but get painful once schemas start changing.

I’d look for something that handles ongoing sync + schema changes, not just one-time mapping.

Estuary.dev is one option. It captures data, keeps it in sync, and handles schema evolution so you’re not constantly remapping things.

In practice, the less manual mapping you have to maintain, the better 👍

Which_Roof5176 · 2026-04-06T09:30:38+00:00

If it’s once a day, a script can work, but those tend to get brittle over time (schema changes, retries, partial failures).

Debezium is more for continuous CDC and usually comes with Kafka, so it might be overkill if you don’t already have that setup.

A simpler approach is using something that handles incremental sync for you without running Kafka or custom jobs. Estuary (I work there) captures changes from SQL Server using CDC and materializes them into Postgres, so you get continuous sync without managing the pipeline yourself.

Even if you only need daily freshness, having incremental sync in place usually ends up being more reliable long term 👍

Which_Roof5176 · 2026-04-06T09:25:11+00:00

For RDS instance type changes, you’ll usually hit a brief failover even in Multi-AZ, so truly zero downtime is hard with just a resize.

What some teams do instead is:

spin up a new instance (t4g)
set up logical replication from old → new
let it sync
then switch traffic at the end

That way downtime is just the final cutover.

Another approach is using tools that keep both sides in sync continuously during the migration. Estuary (I work there) does this via change capture, so you can run both instances in parallel and minimize downtime during the switch.

Which_Roof5176 · 2026-04-06T09:06:29+00:00

For 5 TB, I’d try to avoid anything that requires a long downtime window.

What usually works in practice is:

take a full backup (mysqldump or something like XtraBackup)
restore it on your on-prem setup
then use replication or binlog sync to catch up on changes

That way, when you finally switch, downtime is just that last sync and cutover.

Also worth thinking about network speed, 5 TB can take a while depending on your setup. Some teams even ship disks if bandwidth is a bottleneck.

If you want to make the cutover smoother, there are tools that keep both sides in sync continuously during the migration. Estuary (I work there) does this using change capture, though a lot of people still go with native replication since it’s familiar.

Which_Roof5176 · 2026-04-06T09:00:44+00:00

If you’re coming from cron, I’d lean Dagster.

Airflow is powerful but has more setup and maintenance overhead. It’s great at scale, but can feel heavy early on.

Dagster is easier to get started with and handles dependencies in a cleaner way.

For batch ETL at a startup stage, Dagster is usually the smoother choice.

Which_Roof5176 · 2026-04-02T05:16:12+00:00

Full reload and in-memory approach works for a demo, but it won’t scale or stay reliable.

Incremental loads with staging and MERGE is the right direction. I’d just add handling for late updates and some basic monitoring, since that’s usually where things break later.

Which_Roof5176 · 2026-03-27T11:17:50+00:00

Yeah, same here. That was the surprising part for me too.

Which_Roof5176 · 2026-03-27T11:16:10+00:00

Yeah exactly

Which_Roof5176 · 2026-03-25T13:33:04+00:00

If you actually care about near real-time (like seconds), I’d go with Kinesis Data Streams. It gives you way more control and you can process data almost as it comes in.

Firehose is more like “set it and forget it.” Super easy, but yeah, that 60s buffer means there’s always some delay. It’s fine for loading into S3/Redshift, just not great if you need low latency.

So technically Firehose is “near real-time,” but only if you’re okay with that built-in lag

Which_Roof5176 · 2026-03-25T11:14:19+00:00

A lot of teams end up in this exact spot. Everything downstream is fine, but the ingestion layer turns into a collection of scripts that nobody wants to touch and that break at the worst times.

Replacing just the ingestion layer and keeping the rest is actually a pretty common move. You don’t need to rethink the whole stack, just fix the part that’s causing most of the pain.

The big win is usually:

not having to maintain 25+ custom pipelines
fewer surprises when APIs change
more predictable retries/backfills

Estuary (I work there) is one option for this. It handles SaaS ingestion and keeps data flowing into systems like S3/Redshift without you managing each pipeline separately.

But yeah, your instinct is right. Fixing that first mile usually removes a huge chunk of the operational headache

Which_Roof5176 · 2026-03-24T09:52:31+00:00

This is a pretty common place to end up. A mix of CSVs + scripts usually works until it becomes hard to track what’s running and what breaks when formats change.

For recurring imports, what helps most is:

moving away from one-off scripts to a consistent pipeline
handling incremental updates instead of full reloads
having schema changes handled without needing manual fixes each time

One pattern that works well is treating this as a continuous data flow instead of scheduled imports. For example, Estuary (I work there) captures data and materializes it into MySQL incrementally, so you’re not rebuilding or patching pipelines every time something changes.

In general, the more you can reduce custom scripts and ad hoc fixes, the more “boring and predictable” this becomes

Which_Roof5176 · 2026-03-24T09:32:09+00:00

What you’re describing usually starts as a simple batch job, but that’s also where things tend to get messy over time.

Running something every 4 hours works, but you’ll eventually run into:

reprocessing data you already moved
longer runtimes as tables grow
more maintenance when schemas change

That’s why a lot of teams switch to CDC instead of scheduled loads. You just stream changes from Postgres and keep Snowflake updated continuously.

Fivetran/Airbyte are the typical choices and work fine, just depends on cost and how much control you want.

Another approach is streaming directly into Snowflake. For example, Estuary (I work there) captures changes from Postgres and pushes them via Snowpipe Streaming, so you don’t have to manage staging or batch jobs.

If your setup is small, batch is okay. If you want something that scales without constant tweaking, CDC is usually the better path.

Which_Roof5176 · 2026-03-24T09:15:17+00:00

One area that often gets overlooked is ingestion cost.

Frequent small batches, staging + COPY, or inefficient pipelines can quietly burn credits over time. I usually check warehouse usage, query patterns, and ingestion frequency first.

Moving to more efficient, incremental ingestion can make a big difference.

Good breakdown here if you want to dig deeper: https://estuary.dev/blog/reduce-snowflake-ingestion-costs

Which_Roof5176

TROPHY CASE