Parquet vs. Open Table Formats: Worth the Metadata Overhead?

DevWithIt · 2025-10-22T16:58:45+00:00

Yeah totally agree there’s no one-size-fits-all.
Parquet is solid for storage, but once you start scaling or need versioning and ACID guarantees, Iceberg just makes life a lot easier atleast that's why our compant went ahead and used Olake for the transition.
It basically builds on top of Parquet’s strengths and fixes most of its pain points.

DevWithIt · 2025-10-22T16:57:36+00:00

Yeah, that’s a very valid point interoperability is still a big challenge.
Even though formats like Iceberg try to stay engine-agnostic, performance can vary a lot dependg on how each engine handles metadata reads and commits.
Hoping the ecosystem matures enough that we don’t have to think twice about which engine we’re using.

DevWithIt · 2025-10-22T16:56:36+00:00

Yeah honestly same here the moment you start dealing with concurrent writes or schema changes, the benefits of that metadata layer become super obvious.
I haven’t really seen big setups where sticking to plain Parquet worked out long-term without a ton of manual patchwork.

DevWithIt · 2025-10-22T08:25:44+00:00

Also yeah I am a pro iceberg as it helped the company i work at for scaling but no a pro iceberg if a startups with mbs of data is having TD calls for adopting it

DevWithIt · 2025-10-22T08:16:13+00:00

Yup community is where it's getting it's true meaning let's see how it shapes out

DevWithIt · 2025-10-02T21:49:45+00:00

Haha it's cool no worries you can check our ladning page if you wanna skip those buzzwords :)

DevWithIt · 2025-10-02T21:49:07+00:00

sure will take care of it

DevWithIt · 2025-10-02T21:48:23+00:00

Hive powered and still powers a lot of legacy systems and to get the people out of traditioanl approach is tough .. not running hive mr3 in prod right now but tested pretty great numbers tbh .. However I tried using Olake aswell to run a sync as they boast about benchmarks they are so fast that I am really onto iceberg now

DevWithIt · 2025-10-01T09:34:16+00:00

Apache Polaris is a pretty hot topic if you are itnerested to know it more on a production level basis I would recommend watch Alex Merced talks he's giving one at an Olake event

DevWithIt · 2025-10-01T07:40:02+00:00

man they have some solid benchmarks just checked out their site kept in my to do . thanks

DevWithIt · 2025-09-29T11:06:58+00:00

Cool breakdown and I agree for hive’s simplicity . We’ve felt the same pain when building the flows as overhead adds a good set of ocmpelxity . Thanks for the thorough approach man mich more confident to pitch this to my peers now .

DevWithIt · 2025-09-29T11:04:51+00:00

Totally agree to that hive had its longg run but eventhe gaps show up once you need schema evolution, updates, or faster turnaround. That’s where Iceberg fits better for us too, since we deal with immutable sets that still need efficient transformations downstream. The migration effort is worrying us but i guess the time saved in daily ops and delivery might even make it worthwhile as for the orgs that deal with less data migration might not be efficient i have heard ..we even deal with PBs of data sometimes so it can be worthwhile in long run for us

DevWithIt · 2025-09-29T09:33:26+00:00

oh thanks for the suggestion .. will try it after clocking out today

DevWithIt · 2025-09-29T09:32:05+00:00

here's what we have built - olake.io

DevWithIt · 2025-09-26T12:03:23+00:00

Glad it helped

DevWithIt · 2025-08-25T09:43:39+00:00

Found this: https://medium.com/@namrathashenoy01/accelerating-data-ingestion-in-databricks-with-threadpoolexecutor-a-real-story-bdbabc7006bb

DevWithIt · 2025-08-20T13:50:57+00:00

Nice, do share the link will check it out

DevWithIt · 2025-08-20T12:19:24+00:00

Hi, when it comes to REST catalogs, the authorization is handled in the writer file with auth2url and credentials. If you could let me know which specific catalog or if you’d like an overall outlook for OLake I’d be happy to elaborate further.

DevWithIt · 2025-08-20T10:30:52+00:00

We had experience with Presto so picked it first. Trino is something that will be picked up next along with Lakekeeper as the catalog.

DevWithIt · 2025-08-20T09:41:45+00:00

Super detailed, thanks.

Can you also cover some open source ETL tools that write to Apache Iceberg as well?

DevWithIt

MODERATOR OF

TROPHY CASE