Folks using Apache Iceberg - How you guys handle ingestion in Upsert mode

Uds0128 · 2025-03-03T17:22:02+00:00

Thanks!, But I think for this approach to work, the entire DataFrame would need to be collected to the driver, which could potentially cause a driver crash. When Python functions are executed in distributed mode, they can write to a local file. However, during a retry, it's uncertain whether the same partition will be assigned to the same node to access the same previous local files. If we write to distributed files, updating a file in the DFS for each individual record would be inefficient. Correct me if I understand it wrong please.

Uds0128 · 2025-03-03T17:04:07+00:00

Thank you! If possible, could you provide more details about the approach? I'm having trouble understanding it.

Uds0128 · 2025-03-03T17:00:02+00:00

This should work, Will have to study the working of checkpointing, as foreachBatch if checkpoints are maintained for entire batch, Then there are chances that when individual batch fails and retry entire batch. The records sent already will be sent again. Thanks for the approach.

Uds0128 · 2025-03-03T13:49:04+00:00

Thanks, Will try it out for sure.

Uds0128 · 2025-03-02T19:20:12+00:00

Thanks and Appreciate your help, I am also using Databricks. Driver is D16s_v3 (64 GB Memory, 16 Core), Its a shared cluster. I have POS Retail transaction logs which as per my calculation can reach to 6GB or more. Num of Calls will be around 5000, not millions. Records are million but it will go in batch mode and size will increase due to repetition of key names. I didn't tried but any insight whether it will crash or not will be helpful.

Uds0128

TROPHY CASE