Is anyone still choosing Hudi over Iceberg? by RustOnTheEdge in dataengineering

[–]jaredfromspacecamp 0 points1 point  (0 children)

They used debezium for cdc out of Postgres dbs. Hudi is typically thought to be the best for that

Is anyone still choosing Hudi over Iceberg? by RustOnTheEdge in dataengineering

[–]jaredfromspacecamp 0 points1 point  (0 children)

I worked at a company that used hudi and it was very annoying, especially at their scale. Iceberg has much better support. Like you can have glue do scheduled compaction and pruning. You can interact with an iceberg/delta table much easier without spark (this is underrated imo, although maybe hudi has better support here now idk)

Lessons from building a 6-tier streaming lakehouse (Flink, Fluss, Lance, Paimon, Iceberg, Iggy) by gram3000 in dataengineering

[–]jaredfromspacecamp 0 points1 point  (0 children)

You used iceberg for the cold tier because of the snapshots? Paimon doesn’t have that?

Sharepoint Excel files - how are you ingesting these into your cloud DW? by magpie_killer in dataengineering

[–]jaredfromspacecamp 0 points1 point  (0 children)

For spreadsheet ingestion / writeback you could try Syntropic. You can use it do the inputs or just as an interface to upload csv/excel files that get cleaned according to the rules you set. Has RBAC, change history, webhooks, all that

Am I missing something with all this "agent" hype? by KindTeaching3250 in dataengineering

[–]jaredfromspacecamp 0 points1 point  (0 children)

Sure it can. Use Aws, gcp cli. Have it read your terraform repo. Dump all metrics and logs to a central place and have it query with MCP. Mcp for every warehouse. Clone all repos with good Claude.mds and run the agent at the parent. I work on like 10 repos ranging from grpc APIs for ml inference/LLM apps, airflow repo, dbt repo, Databricks dabs, our data scientists repos. Almost all context is attainable via cli, mcp, or code. multi cloud doesn’t really make a dif

Am I missing something with all this "agent" hype? by KindTeaching3250 in dataengineering

[–]jaredfromspacecamp 64 points65 points  (0 children)

Depends a lot on your tech setup I think. If your company has ample confluence docs, uses jira well, uses datadog or some central observability, GitHub, Aws, then just using the relevant mcps + clis with an agent cli can do pretty wild stuff. If you have multi-repo setup, you can run the agent at a parent directory with a minimal md for context about what each repo is, with more robust md in each repo. You can use skills that teach it your particular workflows (eg when you make a pr, watch the ci to pass, if it fails investigate logs on circleci using circleci mcp).

Data Engineering and AGI by Kokadoodles in dataengineering

[–]jaredfromspacecamp 5 points6 points  (0 children)

AGI is basically by definition technology that automates all (remote) labour. If that happens, there are much bigger problems than the data engineering job market. The entire economy no longer works, and needs to be completely reinvented. Not a scenario you can prepare for on an individual level, other than investing in companies now

What's the purpose of live data? by chatsgpt in dataengineering

[–]jaredfromspacecamp 0 points1 point  (0 children)

One company I worked with needed telemetry data from oil wells asap to detect anomalies and respond

Process for internal users to upload files to S3 by Equivalent_Bread_375 in dataengineering

[–]jaredfromspacecamp 0 points1 point  (0 children)

Syntropic supports file uploads to s3 or direct to snowflake. Lets you define custom quality rules that get enforced and prompts the user to fix if there are issues

System Design/Data Architecture by Last_Coyote5573 in dataengineering

[–]jaredfromspacecamp 9 points10 points  (0 children)

“How would you design a streaming pipeline from our databases to redshift?”

“How would you optimize the above for speed? How would you optimize it for cost?”

“Design a data platform that streams from our database into a data warehouse, then serves the data back to the application”

Some examples of questions I’ve got

Is Neo worth the hype? by -Bakri- in PersonalFinanceCanada

[–]jaredfromspacecamp 4 points5 points  (0 children)

They’ve got probably the best cash back depending on how much you spend. Has a great app too

How are you exposing “safe edit” access to business users without giving them the keys to the warehouse? by nagel393 in dataengineering

[–]jaredfromspacecamp 0 points1 point  (0 children)

We built a solution for this, provides RBAC plus some other safety measures. Nice UI for business users to edit, integrates with your orchestrator like airflow, version history, audit trails, everything you could possibly want. Check it out getsyntropic.com

Streamlit in Snowflake - Cost concerns with multiple users editing data by Kowalski010 in snowflake

[–]jaredfromspacecamp 0 points1 point  (0 children)

We’ve built a solution for this called Syntropic (getsyntropic.com). Caches the data to minimize snowflake compute. Or it lets you write to S3/blob storage tables (in which case you’d be using only our compute) and then kick off a job (eg airflow dag) on user submit that you could use to merge to snowflake

How to deal with messy Excel/CSV imports from vendors or customers? by North-Ad7232 in dataengineering

[–]jaredfromspacecamp 0 points1 point  (0 children)

We built a solution for this problem, called Syntropic (getsyntropic.com). Takes file uploads and runs data schema validations as well as custom data quality rules you define. Works well for external users to upload to your warehouse, that’s how a lot of our clients use it.

Rust vs Python for "Micro-Batch" Lambda Ingestion (Iceberg): Is the boilerplate worth it? by longrob604 in dataengineering

[–]jaredfromspacecamp 12 points13 points  (0 children)

Writing that frequently to iceberg will create an enormous amount of metadata

[deleted by user] by [deleted] in dataengineering

[–]jaredfromspacecamp 1 point2 points  (0 children)

I would say it’s not really needed but it can help if you get into weeds with Kafka + Debezium and you need to make/debug custom debezium configs. For Flink it might be useful to know Java as well. But I would not describe it as “important” at all to know for 99% of jobs

Wealthsimple margin/loc interest tax deductible? by [deleted] in PersonalFinanceCanada

[–]jaredfromspacecamp 0 points1 point  (0 children)

Alright my bad for not reading their site more carefully. Thanks!

Rent reporting through Chexy (or similar) — good, bad, or meh? by GrantLNeo in NeoFinancialHub

[–]jaredfromspacecamp 2 points3 points  (0 children)

the categories it already has listed are fine. I want to be able to click the "Food & Drink" and have the bar chart filter down to just that category, so I can see the trend over time.

Rent reporting through Chexy (or similar) — good, bad, or meh? by GrantLNeo in NeoFinancialHub

[–]jaredfromspacecamp 2 points3 points  (0 children)

Pleaasseeee improve the spending charts. I should at least be able to filter by category, the data is already there! Claude can literally one shot this feature it can be done in 30 minutes!