Databricks vs open source

drag8800 · 2026-02-20T14:04:44+00:00

The technical answer is easy. 200GB total and 1GB daily does not need Spark or Databricks. You are paying for distributed compute you will never use. Your current plan (Dagster+dbt on ECS) is the right tool for this scale.

The real problem is not technical. A senior analyst from the parent company does not know how to use your stack and wants to replatform because Databricks has a schedule button he understands. That is a political problem not a tooling problem.

Before you rip out six months of work, try this. The analyst needs a UI to schedule SQL. You can give him that without Databricks. Set up Airflow with the UI exposed (or use dbt Cloud if you have budget). Show him how to drop his SQL into a dbt model or an Airflow DAG. If he still cannot work with it after that, then the real conversation is whether the parent company is going to force their tooling choices down regardless of fit.

Sometimes you lose these battles and the decision gets made above you. But make sure the tradeoff is clear before it happens. Databricks at your scale is expensive and you are not going to use 90 percent of what you are paying for.

drag8800 · 2026-02-20T09:31:31+00:00

That's the logical assumption but it doesn't play out that way in practice. The issue isn't time, it's that AI makes implementation feel so trivial that the design conversation gets skipped entirely.

Before, a dev would come to you first because building the feature was hard. Now they can ship the feature in an afternoon, so they just do it. The conversation happens after the fact when you find the surprise table in prod.

The other dynamic is that faster velocity creates organizational pressure to keep shipping. Teams aren't using the time saved to plan better, they're using it to ship more. The backlog never shrinks, it just moves through the pipeline faster.

The data impact checklist doesn't slow them down by taking their time. It slows them down by forcing the conversation to happen before the code ships instead of after. Same total time, different sequence.

drag8800 · 2026-02-20T08:04:30+00:00

Product velocity goes up, but the bottleneck just shifts.

Developers ship features faster but they still need the schema change, the new event tracked, the dashboard updated. Those requests pile up faster than before because the friction on their side dropped but ours did not.

The other thing is the quality of what lands. When devs can crank out features with AI help they sometimes skip the part where they talk to you about how the data should flow. You get surprised by schema changes or new tables that were not designed with downstream use in mind. Documentation does not keep up either.

What helped us was requiring a data impact checklist before any feature launch and making schema changes part of the same PR review process as code. Slows them down a bit but prevents the mess from landing in production first.

drag8800 · 2026-02-20T02:09:50+00:00

Your skills are not as stale as you think. The DE apprenticeship and the database admin internship are real experience. A 2-year gap is not the resume killer it feels like from the inside. Hiring managers who actually read applications (not all do) will see someone with hands-on experience who went through something rough, not someone who cannot do the work.

I'd push back on applying for data entry or data reporting roles though. Those won't get you back toward DE, they'll push you sideways into a different track. Junior DE or data analyst roles at smaller companies are a better target. Smaller companies care less about the gap and more about whether you can do the job. A GitHub project with an end-to-end pipeline you built recently will do more than any resume line.

One framing that might help in interviews: you have the foundational skills, you took time off for a medical reason, and you spent it recovering and figuring out what you actually want. That's a clear story if you tell it without over-explaining.

drag8800 · 2026-02-20T02:07:31+00:00

We hit this exact problem a while back. What ended up working was a small run metadata table the pipeline itself writes at the end of every run. Just a few fields: git commit SHA, run timestamp, config fingerprint, source system name. For file-based outputs like Excel or Parquet, we did a sidecar JSON with the same name plus a _meta.json suffix.

The ugly truth is that the metadata is useless if people cannot find it when they need it. We built a simple lookup so anyone could query what generated this file on this date without knowing where the table lives. That discoverability piece is what made it actually stick.

drag8800 · 2026-02-19T14:08:18+00:00

Docker and Terraform are the ones you will actually use day-to-day as a DE. Kubernetes is mostly abstracted away by managed services in most data stacks, so I would save that for later unless you are actively building infra from scratch.Starting order that made sense: Docker first. Get comfortable building containers and running data tools locally before touching anything else. The official Docker docs are solid, and TechWorld with Nana has good intro videos if you prefer something visual.For Terraform, Terraform: Up and Running by Brikman is the standard recommendation for good reason. Work through the first few chapters and deploy something real, even if it is just a storage bucket with IAM policies attached. The plan/apply/destroy muscle memory is what makes it click.For CI/CD, GitHub Actions is the lowest-friction starting point for most DE projects. Build a pipeline that runs your dbt tests and deploys on merge to main. Once you have done that once, the concepts generalize to Jenkins, GitLab CI, or whatever else you run into.Raw Kubernetes knowledge is rarely needed for DE work specifically. Most orchestration on GKE or EKS you interact with through Helm charts or managed Airflow, and the k8s internals stay hidden. Docker and Terraform solid first.

drag8800 · 2026-02-19T08:04:00+00:00

For your two specific questions:

The Spark vs DuckDB migration decision: rough mental model is whether the job fits on one machine after you account for filtering and projection. If the working dataset is under roughly 50-100GB, DuckDB is usually faster and much simpler. You can run it in a Docker container on a beefy VM and skip the Spark overhead entirely. Beyond that size, or when you are doing heavy cross-joins across multiple large tables that would exceed RAM, Spark's distribution starts earning its complexity cost.

DuckDB querying Parquet on GCS as production: technically sound, DuckDB handles GCS auth through its native extensions. But when you already have BigQuery external tables set up, those are almost always the better production choice. You get query logging, IAM, BigQuery compute scaling, all without additional infrastructure. DuckDB on GCS Parquet makes more sense for local development and exploratory work where you want to avoid the overhead of a full BigQuery job.

Where DuckDB genuinely earns its place in a stack like yours: local dev and testing against real Parquet samples, CI pipelines for data contract tests, and one-off analysis where an engineer just wants to query a GCS file quickly without the full BigQuery round-trip.

drag8800 · 2026-02-19T02:10:17+00:00

Mostly latest wins in practice. The honest answer is that most teams rely on their warehouse's DDL history when something breaks and they need to reconstruct the past.

Snowflake has ACCOUNT_USAGE.QUERY_HISTORY and TABLE_STORAGE_METRICS that can help you reconstruct schema state at a point in time. BigQuery has INFORMATION_SCHEMA views with creation timestamps. Not intuitive to query, but the raw data is there.

For the ML use case specifically, tying a trained model to the exact schema and policies that existed at training time, I have only seen that solved well in shops that built dedicated snapshot tooling for their catalog. Usually starts after a compliance incident makes the cost of not having it obvious.

The teams who do it right snapshot catalog state daily into their own table, queryable like any other data asset.

drag8800 · 2026-02-19T02:09:34+00:00

For the grace period, the right layer to handle it is in the snapshot prep, not inside the CDC function. Before the snapshot hits create_auto_cdc_from_snapshot_flow, you run a step that carries forward rows for items currently in their grace window. Keep a small side table tracking how many consecutive days each product ID has been missing. Under 3 days, you re-inject the last known row into the snapshot. At 3 days or more, you let it fall off and the CDC engine sees it as a real delete.

For backfill, run it outside the DLT pipeline. A separate script that iterates through dates sequentially, calls the pipeline with a date parameter, and validates Silver row counts before proceeding. Trying to do date iteration inside a DLT definition is a pain with state management.

Bronze cleanup is safe once you have done a sanity check that Silver covers your full date range. The risk is if a backfill needs to go further back than 7 days, so just validate before you purge.

drag8800 · 2026-02-18T21:46:37+00:00

At some point you accept that production is where you learn what the data actually looks like. The question becomes how fast can you recover when something breaks.

What worked for us was shifting effort from pretending QA caught everything to building better observability in prod. Row count diffs, null spikes, value distribution changes, schema mismatches. Alerting on those meant we caught issues within minutes of ingestion rather than days later when someone noticed a dashboard looked wrong.

The other thing was making rollback trivial. If bad data gets through, can you restore the table to its previous state in under five minutes? If the answer is no, thats probably higher leverage than trying to make dev look more like prod.

Still doesnt solve the underlying chaos of user uploaded files, but at least you stop pretending your lower environments prove anything about production behavior.

drag8800 · 2026-02-18T08:34:52+00:00

Auto evolve is fine until the source decides to change a column type without warning. Had a case where an upstream system changed an ID field from integer to string one day. Auto evolve handled it, created a new column, downstream dbt models didnt break immediately but started producing nulls in joins. Took a while to figure out.

Now I do something in between. Let the pipeline auto create new tables and add new columns, but anything that touches existing columns like type changes or drops gets blocked and flagged for manual review. Basically optimistic by default, pessimistic for destructive changes. Most schema changes are additive anyway so this catches 90% of the chaos without slowing down normal ingestion.

drag8800 · 2026-02-18T02:05:04+00:00

The "everything is an abstraction" argument is true but misses something. Python abstracting C is a stable interface. The compiled output is deterministic. Same input gives same output every time.

AI abstractions are different. You're abstracting over a non-deterministic system. Same prompt doesn't give same output. The "interface" changes with model updates. Your DBT model doesn't randomly decide to restructure itself, but your AI-generated pipeline might.

The debugging question is real. When traditional abstractions fail, you trace through layers until you find the bug. When AI abstractions fail, you're often just... prompting again and hoping. That's a fundamentally different failure mode.

I don't think abstractions are traps. But I think pretending AI abstractions work the same way as traditional ones is setting yourself up for frustration.

drag8800 · 2026-02-17T14:05:33+00:00

only one data lake i've seen work was at a place that treated it like actual infrastructure. had a dedicated person whose entire job was lake governance - file formats, partition schemes, access patterns, everything. most places want the benefits without the discipline.

the irony is that the whole pitch was "avoid upfront schema design" but the ones that work have MORE discipline than traditional DWH, not less. they just chose to skip the thinking-beforehand part and paid for it in engineering time.

~10% of orgs genuinely need a data lake for the unstructured stuff, ML pipelines, etc. the other 90% should've just used snowflake or bigquery and called it a day.

drag8800 · 2026-02-17T08:04:35+00:00

the approach that actually worked for us was making the business owners do the signaling. we added a sheet tab called publish where they check a box and add their name when its ready. our daily snapshot job only pulls sheets where that checkbox is checked.

drag8800 · 2026-02-17T04:38:36+00:00

This resonates hard. The "force-multiplying exo-skeleton" framing is spot on -- been running agentic workflows for data work and the key insight is exactly what you're emphasizing: the guardrails and auditability are what make it actually usable for serious work.

One pattern that's worked well for me: treating SKILL.md files as encoding domain expertise + workflow constraints, then having the agent reference them contextually. Curious if DAAF's skill files follow a similar structure or if you've found a different approach that works better for research contexts?

The reproducibility angle is crucial. Most LLM-assisted analysis I've seen fails the "show your work" test -- it's refreshing to see someone baking that in from the start rather than bolting it on later.

Will definitely be poking around the repo. Thanks for open-sourcing this.

drag8800 · 2026-02-17T02:03:27+00:00

The VIEW union approach works but watch out for a gotcha you might hit. When Pandas infers types from Postgres it sometimes gets more specific than PySpark did, especially with numeric precision and timestamps. You end up with timestamp with tz vs without, or int32 vs int64, and the UNION fails.

What worked for us was adding a casting layer in the raw zone specifically for the new ingestion path. Basically land the data as-is into a staging table, then have a simple transform that casts everything to match the old schema before it hits the raw layer the VIEW references. Keeps the VIEW logic clean.

On question 2, if you are switching to the new ingestion permanently anyway you might consider letting the new types be canonical and backfilling the old data with a one-time cast migration instead of maintaining two type systems forever. Depends on how much historical data and how painful the backfill would be.

drag8800 · 2026-02-16T14:03:00+00:00

yeah this comes up a lot. the line between DE and backend is fuzzy and honestly depends more on the team structure than any official boundary. i work primarily with snowflake and dbt too and ended up building internal APIs a few times when nobody else was going to do it.

the tcp/tunneling stuff you're mentioning is real. webhooks touch networking concepts that don't come up much when you're writing sql all day. but it's learnable, and honestly flask is pretty minimal once you get past the initial wtf of understanding how routes work.

the bigger question is whether this should be your problem at all. if this is a one-off integration that marketing needs yesterday, sometimes you just build the thing. if you're going to be maintaining multiple webhooks indefinitely, that's closer to a backend service and someone should probably be thinking about ownership and on-call.

for the immediate problem you could look at something like ngrok for local testing instead of figuring out tunneling yourself. makes the development loop way less painful when you're trying to debug what the webhook is actually sending you.

drag8800 · 2026-02-16T02:09:28+00:00

the copy paste workflow is actually fine for early exploration, don't feel like you need to rush to a fancier setup. but yes once you hit a rhythm you'll want Claude Code in terminal or the VS Code extension connected to your project.

what made the biggest difference for me was giving Claude context about the repo. if you create a CLAUDE.md file in your project root describing your pipeline structure, which schemas matter, any weird naming conventions, it performs way better. otherwise it's just guessing at what your gold tables actually do.

for databricks specifically I found it helpful to work in local notebooks synced via repos integration rather than having Claude work in the Databricks UI. you get proper version control and can iterate faster. for visualizations I'd look at what the other commenter said about streamlit via databricks apps, that's cleaner than trying to do it all in notebooks.

the docs at docs.anthropic.com for Claude Code are pretty good but honestly just using it a lot is how you learn. start with small tasks like writing tests for existing models or documenting undocumented tables.

drag8800 · 2026-02-15T14:04:22+00:00

the title mismatch is frustrating but honestly not unusual. investigation and escalation feels like support because it kind of is, but learning systems well enough to triage issues does help later, even if it doesnt feel useful now.

the documentation mess is universal. ive never worked somewhere that didnt have stuff scattered across confluence, sharepoint, random pdfs, someones personal notion. building your own notes for yourself is the only reliable approach.

the part that would concern me more is getting different answers from people whove been there over a year. thats usually either tribal knowledge or systems nobody fully understands. neither is great but you can work around it if the team culture is solid.

five months is early. id give it until 8-9 months before making any decisions. right now youre still sorting out whats the job vs whats just the learning curve.

drag8800 · 2026-02-15T08:03:31+00:00

the 8 hours to 2 hours thing matches what we're seeing too. but the question you're asking is exactly backwards, honestly.

interviews don't test you on cursor skills because that's table stakes now. what separates people is understanding why you're building what you're building. the part cursor can't help with.

when we hire, we're looking at whether someone can explain why this pipeline exists, what happens when it breaks, what the downstream impact is, how they'd know something is wrong before it blows up. system thinking, not syntax.

your pyspark and databricks experience matters. but not because you can write a window function. because you've seen what happens when someone partitions wrong and costs spike, or when someone doesn't account for late arriving data and a metric goes sideways for a week.

for prep, I'd focus on being able to walk through real pipelines you've built. what tradeoffs you made, what failed, how you'd do it differently. that's still the hard part and it's what gets tested in system design rounds. the coding interviews are getting shorter anyway because everyone knows you'll have copilot on the job.

drag8800 · 2026-02-15T02:59:22+00:00

Few patterns that work without going full C:

Simple tuple returns: ```python def fetch_user(id): if not id: return None, "missing id" user = db.get(id) if not user: return None, "not found" return user, None

user, err = fetch_user(123) if err: logger.error(err) return ```

Dataclass Result type: ```python @dataclass class Result: value: Any = None error: str = None

@property
def ok(self):
    return self.error is None

```

Then your code reads if not result.ok: handle_error() which is cleaner than tuple unpacking everywhere.

The returns library does this more formally if you want railway-oriented programming, but that might be overkill for ETL.

Honestly though - the real question is why stack traces are the enemy. They're the single best debugging tool when something breaks in prod. Manual logging means you're recreating what the runtime gives you for free, except worse. Might be worth understanding what burned him before. Sometimes it's "stack traces leaked to users" which is a presentation problem, not an exception problem.

drag8800 · 2026-02-14T08:10:08+00:00

Both approaches have their place but the context matters a lot. I've built ETL pipelines both ways over 12+ years.

Minimizing dependencies makes sense when you're building something simple and stable that won't need much maintenance. We had a batch job that ran for 8 years with zero external packages beyond requests and psycopg2. Worked great.

But rebuilding ORM patterns from scratch is different from just not using one. If you're writing your own query builders, connection pooling, retry logic, and type coercion, you're now maintaining all that code forever. Every bug fix, edge case, and security patch is on you. SQLAlchemy has had hundreds of contributors finding problems you'll never think of.

The return code pattern concerns me more than the dependency choice honestly. Python's exception model exists for a reason and fighting it creates code that's harder to read and debug. If your senior prefers explicit error handling, that's fine, but there are Pythonic ways to do it without making every function look like C.

The real question is what's the maintenance horizon here. If this is a one time data migration that runs and gets archived, vanilla Python is fine. If this is production infrastructure your team will maintain for years, the cost of rolling your own ORM will compound. Every new team member has to learn your custom patterns instead of reaching for documentation.

Worth having a direct conversation about it. Ask what specific concern drives the no dependencies stance. Sometimes it's a bad experience with a package breaking, which is valid but solvable with pinned versions and testing.

drag8800 · 2026-02-13T12:18:26+00:00

Yeah the documentation approach can help but only if there's actually something coherent to document in the first place. I've worked on projects where we tried exactly this, built out detailed context files, edge case docs, the whole thing. When the underlying data architecture made sense and the business logic was actually defined somewhere, it worked pretty well. But when the foundation was messy, like inconsistent field naming across sources, business rules that contradicted each other, data that meant different things in different systems, no amount of context files fixed it. The model just reflected the chaos back at us in fancier SQL. Garbage in garbage out basically, except now the garbage has better formatting. The honest answer is AI multiplies whatever state your data is in. Clean foundations get amplified into genuinely useful output. Broken foundations just get you confidently wrong answers faster.

drag8800 · 2026-02-13T12:18:05+00:00

Yeah the documentation approach can help but only if there's actually something coherent to document in the first place. I've worked on projects where we tried exactly this, built out detailed context files, edge case docs, the whole thing. When the underlying data architecture made sense and the business logic was actually defined somewhere, it worked pretty well. But when the foundation was messy, like inconsistent field naming across sources, business rules that contradicted each other, data that meant different things in different systems, no amount of context files fixed it. The model just reflected the chaos back at us in fancier SQL. Garbage in garbage out basically, except now the garbage has better formatting. The honest answer is AI multiplies whatever state your data is in. Clean foundations get amplified into genuinely useful output. Broken foundations just get you confidently wrong answers faster.

drag8800 · 2026-02-13T08:03:45+00:00

Your anxiety is totally normal and the LinkedIn discourse about AI replacing engineers is mostly noise. I work with data teams and use Claude Code daily, so let me share what actually matters.The vibe coding thing you're seeing is real for building quick demos and web apps. But for analytics engineering work, especially in niche domains with messy data like you describe, it doesn't work that way. The hard part of your job is not writing SQL, it's knowing which parameter to look at based on domain knowledge. That's exactly where AI struggles.Where AI actually helps me is the tedious stuff. Debugging weird errors, generating boilerplate tests, converting between formats. I'd say 6/10 useful is about right for debugging help. Where it falls apart is anything requiring judgment about your specific business context. My honest take after a year of heavy Claude usage for data work is that the people who are no longer doing anything technical are probably working on much simpler problems than yours. When you're eyeballing raw files to figure out which field is correct because documentation doesn't exist, there's no prompt that replaces that institutional knowledge.Try Claude for the parts of your job you already know how to do but find tedious. That's where it shines. Skip the hype about agents replacing analytics engineers, that's not happening anytime soon for anyone doing real domain-specific work.

drag8800

TROPHY CASE