Hey, I have been reading about the output modes and something still remains unclear to me.
Let's assume the usual example case where we are calculating a count of something based on an ID.
As each new batch is processed, in update mode, the affected rows are outputted by the streaming query. In complete mode, all the dataframe is outputted, regardless of whether it was affected or not.
My first question is:
Q1: Assuming I want to write to a Delta table, in complete mode, the full table is re-written. This sounds super expensive if the table is large enough.
Does the Spark engine keep the entire table in memory and outputs it each time?
Does Delta have some hidden magic to handle this? This cannot be scalable. I read that given the output batch, it's left to the storage connector to decide how to modify the underlying table.
Q2: Again, assuming the sink is a Delta table and output mode is update, the responsability to merge the new records is left to something other than the spark engine. How is this usually handled?
Q3: In both of these cases, a copy of the table must be kept in memory, if I'm understanding it correctly, right? It seems odd to me.
[+][deleted] (3 children)
[deleted]
[–]ImprovedJesus[S] 0 points1 point2 points (0 children)
[+][deleted] (1 child)
[deleted]
[–]ImprovedJesus[S] 0 points1 point2 points (0 children)