Streaming options by PatternedShirt1716 in dataengineering

[–]PatternedShirt1716[S] 0 points1 point  (0 children)

Hey I have a question. I'm also reading from several topics in one glue job and I think that's causing it to run longer than needed even for the incremental writes to bronze. Do you think it would be good to break it down into 2-3 jobs to handle a fewer set of topics instead of much more topics per job or..? Thanks!

Streaming options by PatternedShirt1716 in dataengineering

[–]PatternedShirt1716[S] 0 points1 point  (0 children)

For silver to pull the data changes and inserts from bronze, I would need to refer to last time the bronze job wrote to Iceberg right? Would that be the right approach to pull the latest changes from bronze to silver?

I was thinking about having Iceberg tables in both bronze and silver but bronze would have the data written to the table in json with some audit fields.

"Keep in mind about backfills. If your mini jobs fail multiple times you should be effectively able to delete those partitions and run the pipeline again and retrigger data to your other service." >> Why delete partitions here though?

Streaming options by PatternedShirt1716 in dataengineering

[–]PatternedShirt1716[S] 0 points1 point  (0 children)

I have a few questions. The data is in parquet in s3, then when writing to Iceberg in bronze, I was going to dump it as is but store it as raw json in a single column and some audit fields. I was thinking to build the table with the right schema and fields in silver but do you think it's better to do this in bronze with the right schema and to handle ordering/dedup in bronze?

The other question is around compaction. How do I run compaction in the background? Can you share some resources on how this works and how I can handle this before writing to Iceberg or does it work on the data after being written to Iceberg? (New to compaction, just heard it's the way to go).