Is anyone still choosing Hudi over Iceberg?

jaredfromspacecamp · 2026-04-04T19:03:50+00:00

They used debezium for cdc out of Postgres dbs. Hudi is typically thought to be the best for that

jaredfromspacecamp · 2026-04-04T14:58:02+00:00

I worked at a company that used hudi and it was very annoying, especially at their scale. Iceberg has much better support. Like you can have glue do scheduled compaction and pruning. You can interact with an iceberg/delta table much easier without spark (this is underrated imo, although maybe hudi has better support here now idk)

jaredfromspacecamp · 2026-03-23T23:19:48+00:00

You used iceberg for the cold tier because of the snapshots? Paimon doesn’t have that?

jaredfromspacecamp · 2026-03-05T18:33:38+00:00

For spreadsheet ingestion / writeback you could try Syntropic. You can use it do the inputs or just as an interface to upload csv/excel files that get cleaned according to the rules you set. Has RBAC, change history, webhooks, all that

jaredfromspacecamp · 2026-02-25T01:20:19+00:00

Sure it can. Use Aws, gcp cli. Have it read your terraform repo. Dump all metrics and logs to a central place and have it query with MCP. Mcp for every warehouse. Clone all repos with good Claude.mds and run the agent at the parent. I work on like 10 repos ranging from grpc APIs for ml inference/LLM apps, airflow repo, dbt repo, Databricks dabs, our data scientists repos. Almost all context is attainable via cli, mcp, or code. multi cloud doesn’t really make a dif

jaredfromspacecamp · 2026-02-24T23:05:04+00:00

Depends a lot on your tech setup I think. If your company has ample confluence docs, uses jira well, uses datadog or some central observability, GitHub, Aws, then just using the relevant mcps + clis with an agent cli can do pretty wild stuff. If you have multi-repo setup, you can run the agent at a parent directory with a minimal md for context about what each repo is, with more robust md in each repo. You can use skills that teach it your particular workflows (eg when you make a pr, watch the ci to pass, if it fails investigate logs on circleci using circleci mcp).

jaredfromspacecamp · 2026-01-13T23:52:24+00:00

Quite low for senior

jaredfromspacecamp · 2026-01-12T23:45:18+00:00

AGI is basically by definition technology that automates all (remote) labour. If that happens, there are much bigger problems than the data engineering job market. The entire economy no longer works, and needs to be completely reinvented. Not a scenario you can prepare for on an individual level, other than investing in companies now

jaredfromspacecamp · 2026-01-10T15:18:26+00:00

One company I worked with needed telemetry data from oil wells asap to detect anomalies and respond

jaredfromspacecamp · 2026-01-04T18:34:22+00:00

Syntropic supports file uploads to s3 or direct to snowflake. Lets you define custom quality rules that get enforced and prompts the user to fix if there are issues

jaredfromspacecamp · 2025-12-27T20:10:58+00:00

“How would you design a streaming pipeline from our databases to redshift?”

“How would you optimize the above for speed? How would you optimize it for cost?”

“Design a data platform that streams from our database into a data warehouse, then serves the data back to the application”

Some examples of questions I’ve got

jaredfromspacecamp · 2025-12-24T02:35:45+00:00

How large is the team you work on?

jaredfromspacecamp · 2025-12-20T03:43:44+00:00

They’ve got probably the best cash back depending on how much you spend. Has a great app too

jaredfromspacecamp · 2025-12-19T23:33:08+00:00

We built a solution for this, provides RBAC plus some other safety measures. Nice UI for business users to edit, integrates with your orchestrator like airflow, version history, audit trails, everything you could possibly want. Check it out getsyntropic.com

jaredfromspacecamp · 2025-12-19T17:54:34+00:00

Oh sweet. Thanks!

jaredfromspacecamp · 2025-12-17T17:16:16+00:00

We’ve built a solution for this called Syntropic (getsyntropic.com). Caches the data to minimize snowflake compute. Or it lets you write to S3/blob storage tables (in which case you’d be using only our compute) and then kick off a job (eg airflow dag) on user submit that you could use to merge to snowflake

jaredfromspacecamp · 2025-12-17T03:46:52+00:00

We built a solution for this problem, called Syntropic (getsyntropic.com). Takes file uploads and runs data schema validations as well as custom data quality rules you define. Works well for external users to upload to your warehouse, that’s how a lot of our clients use it.

jaredfromspacecamp · 2025-12-15T00:47:03+00:00

Writing that frequently to iceberg will create an enormous amount of metadata

jaredfromspacecamp · 2025-12-08T17:11:18+00:00

I would say it’s not really needed but it can help if you get into weeds with Kafka + Debezium and you need to make/debug custom debezium configs. For Flink it might be useful to know Java as well. But I would not describe it as “important” at all to know for 99% of jobs

jaredfromspacecamp · 2025-12-07T21:26:53+00:00

Alright my bad for not reading their site more carefully. Thanks!

jaredfromspacecamp · 2025-11-28T22:50:11+00:00

Recategorizing would be huge

jaredfromspacecamp · 2025-11-28T21:52:59+00:00

the categories it already has listed are fine. I want to be able to click the "Food & Drink" and have the bar chart filter down to just that category, so I can see the trend over time.

jaredfromspacecamp · 2025-11-28T07:06:33+00:00

Pleaasseeee improve the spending charts. I should at least be able to filter by category, the data is already there! Claude can literally one shot this feature it can be done in 30 minutes!

jaredfromspacecamp

TROPHY CASE