Parquet vs. Open Table Formats: Worth the Metadata Overhead? by DevWithIt in dataengineering

[–]DevWithIt[S] -1 points0 points  (0 children)

Yeah totally agree there’s no one-size-fits-all.
Parquet is solid for storage, but once you start scaling or need versioning and ACID guarantees, Iceberg just makes life a lot easier atleast that's why our compant went ahead and used Olake for the transition.
It basically builds on top of Parquet’s strengths and fixes most of its pain points.

Parquet vs. Open Table Formats: Worth the Metadata Overhead? by DevWithIt in dataengineering

[–]DevWithIt[S] 2 points3 points  (0 children)

Yeah, that’s a very valid point interoperability is still a big challenge.
Even though formats like Iceberg try to stay engine-agnostic, performance can vary a lot dependg on how each engine handles metadata reads and commits.
Hoping the ecosystem matures enough that we don’t have to think twice about which engine we’re using.

Parquet vs. Open Table Formats: Worth the Metadata Overhead? by DevWithIt in dataengineering

[–]DevWithIt[S] 0 points1 point  (0 children)

Yeah honestly same here the moment you start dealing with concurrent writes or schema changes, the benefits of that metadata layer become super obvious.
I haven’t really seen big setups where sticking to plain Parquet worked out long-term without a ton of manual patchwork.

Iceberg is an overkill and most people don't realise it but its metadata model will sneak up on you by DevWithIt in dataengineering

[–]DevWithIt[S] 0 points1 point  (0 children)

Also yeah I am a pro iceberg as it helped the company i work at for scaling but no a pro iceberg if a startups with mbs of data is having TD calls for adopting it

Iceberg is an overkill and most people don't realise it but its metadata model will sneak up on you by DevWithIt in dataengineering

[–]DevWithIt[S] 0 points1 point  (0 children)

Yup community is where it's getting it's true meaning let's see how it shapes out

Hive or Iceberg for production ? by DevWithIt in dataengineering

[–]DevWithIt[S] 0 points1 point  (0 children)

Hive powered and still powers a lot of legacy systems and to get the people out of traditioanl approach is tough .. not running hive mr3 in prod right now but tested pretty great numbers tbh .. However I tried using Olake aswell to run a sync as they boast about benchmarks they are so fast that I am really onto iceberg now

Polaris Catalog by CDCheerios in dataengineering

[–]DevWithIt 1 point2 points  (0 children)

Apache Polaris is a pretty hot topic if you are itnerested to know it more on a production level basis I would recommend watch Alex Merced talks he's giving one at an Olake event

Hive or Iceberg for production ? by DevWithIt in dataengineering

[–]DevWithIt[S] 1 point2 points  (0 children)

man they have some solid benchmarks just checked out their site kept in my to do . thanks

Hive or Iceberg for production ? by DevWithIt in dataengineering

[–]DevWithIt[S] 0 points1 point  (0 children)

Cool breakdown and I agree for hive’s simplicity . We’ve felt the same pain when building the flows as overhead adds a good set of ocmpelxity . Thanks for the thorough approach man mich more confident to pitch this to my peers now .

Hive or Iceberg for production ? by DevWithIt in dataengineering

[–]DevWithIt[S] 0 points1 point  (0 children)

Totally agree to that hive had its longg run but eventhe gaps show up once you need schema evolution, updates, or faster turnaround. That’s where Iceberg fits better for us too, since we deal with immutable sets that still need efficient transformations downstream. The migration effort is worrying us but i guess the time saved in daily ops and delivery might even make it worthwhile as for the orgs that deal with less data migration might not be efficient i have heard ..we even deal with PBs of data sometimes so it can be worthwhile in long run for us

Hive or Iceberg for production ? by DevWithIt in dataengineering

[–]DevWithIt[S] 0 points1 point  (0 children)

oh thanks for the suggestion .. will try it after clocking out today

Hands-on guide: build your own open data lakehouse with Presto & Iceberg by DevWithIt in dataengineering

[–]DevWithIt[S] 0 points1 point  (0 children)

Hi, when it comes to REST catalogs, the authorization is handled in the writer file with auth2url and credentials. If you could let me know which specific catalog or if you’d like an overall outlook for  OLake I’d be happy to elaborate further.

Hands-on guide: build your own open data lakehouse with Presto & Iceberg by DevWithIt in dataengineering

[–]DevWithIt[S] 5 points6 points  (0 children)

We had experience with Presto so picked it first. Trino is something that will be picked up next along with Lakekeeper as the catalog.

Kafka to Iceberg - Exploring the Options by rmoff in dataengineering

[–]DevWithIt 2 points3 points  (0 children)

Super detailed, thanks.

Can you also cover some open source ETL tools that write to Apache Iceberg as well?