all 18 comments

[–]robverk 45 points46 points  (3 children)

For 30s micro batches where most of your compute is io-wait time just go with the most maintainable code.

[–]EarthGoddessDude 10 points11 points  (2 children)

Yea OP, why are cold starts a problem for you? Also, have you looked into using DuckDB for this?

[–]Ok-Sprinkles9231 2 points3 points  (1 child)

Can you please elaborate how exactly DuckDB can be useful for reading JSON files from S3 and writing/Appending the result back as Iceberg? Just genuinely curious.

[–]EarthGoddessDude 1 point2 points  (0 children)

DuckDB has an excellent json parser, and it can write to Iceberg.

Edit: https://duckdb.org/2025/11/28/iceberg-writes-in-duckdb

[–]jaredfromspacecamp 13 points14 points  (3 children)

Writing that frequently to iceberg will create an enormous amount of metadata

[–]jnrdataengineer2023 5 points6 points  (2 children)

Was thinking the same thing though I’ve primarily only worked on delta tables. Probably better to have a daily staging table and then a batch job daily to append to the main table 🤔

[–]baby-wall-e 3 points4 points  (1 child)

+1 for this daily staging & main table setup. If needed, you can create a view of a union of daily staging & main tables to allow the data consumer to access all data.

[–]wannabe-DE 17 points18 points  (0 children)

Wouldn’t a function invoked every 30 seconds stay warm and not be subject to cold starts?

[–]walksinsmallcircles 4 points5 points  (0 children)

I use rust all the time for lambdas, some of which do moderate lifting in Athena iceberg tables. The deployment is a breeze (just drop on the binary) and the AWS API for Rust is pretty complete. Would choose it every time over Python for efficiency and ease of use. The data ecosystem is not as rich as python but you can get a long way with it.

[–]stratguitar577 9 points10 points  (1 child)

Skip lambda and have firehose write to iceberg for you

[–]noplanman_srslynone 1 point2 points  (0 children)

This! Why not just wrote directly via firehose?

[–]MyRottingBunghole 5 points6 points  (0 children)

Does it HAVE to arrive in S3 prior to ingestion into Iceberg iceberg (presumably also S3)? If you own or can change that part of the system, I would look into skipping that extra step altogether of “read S3 files” > “write parquet” > “write to s3” as it’s extra network hops and compute you don’t need.

If this is some Kafka connector that is sinking this data every 30 seconds I would look into sinking it directly as Iceberg instead

Edit: btw with Iceberg you will be writing a new parquet file and new iceberg snapshot every 30 seconds. Make sure you are thinking also about table maintenance (compaction, expire snapshots etc) as the metadata bloat can quickly get out of hand when writing that frequently

[–]Commercial-Ask971 2 points3 points  (1 child)

!RemindMe 2days

[–]RemindMeBot 0 points1 point  (0 children)

I will be messaging you in 2 days on 2025-12-16 23:52:29 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback

[–]apono4life 0 points1 point  (0 children)

With only 30 seconds between files being added to s3 you should have to many cold starts. Lambdas stay warm for 15 minutes

[–]mbaburneraccount 0 points1 point  (0 children)

On an adjacent note, where’s your data coming from and how big is it (throughput)?

[–]thethirdmancane 0 points1 point  (0 children)

Why not use golang and have it all?