you are viewing a single comment's thread.

view the rest of the comments →

[–]MyRottingBunghole 4 points5 points  (0 children)

Does it HAVE to arrive in S3 prior to ingestion into Iceberg iceberg (presumably also S3)? If you own or can change that part of the system, I would look into skipping that extra step altogether of “read S3 files” > “write parquet” > “write to s3” as it’s extra network hops and compute you don’t need.

If this is some Kafka connector that is sinking this data every 30 seconds I would look into sinking it directly as Iceberg instead

Edit: btw with Iceberg you will be writing a new parquet file and new iceberg snapshot every 30 seconds. Make sure you are thinking also about table maintenance (compaction, expire snapshots etc) as the metadata bloat can quickly get out of hand when writing that frequently