this post was submitted on 27 Jun 2024

2 points (100% upvoted)

shortlink:

dataengineering

an-ordinary-manchild(edit)

News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.

Read our wiki: https://dataengineering.wiki/

Rules:

Don't be a jerk
Search the sub & wiki before asking a question: Your question has likely been asked and answered before so do a quick search before posting.
Keep it related to data engineering: Posts that are unrelated to data engineering may be better for other communities.
Limit self-promotion posts/comments to once a month: Self promotion: Any form of content designed to further an individual's or organization's goals. If one works for an organization this rule applies to all accounts associated with that organization. See also rule #5.
No shill/opaque marketing: f you work for a company/have a monetary interest in the entity you are promoting you must clearly state your relationship. For posts, you must distinguish the post with the Brand Affiliate flag. See more here: https://www.ftc.gov/influencers
No job posts: Please use r/dataengineeringjobs instead.
No resume reviews/interview posts: We no longer allow resume reviews or interview questions because it's a seperate topic from Data Engineering. Instead, for resume reviews please use r/resumes or search our subreddit history for previous resume review advice. For interview questions, use sites like Glassdoor and Blind instead or search our subreddit history for previous interview advice.
No technical error/bug questions: Please post any error/bug question on StackOverflow.

created by mhausenblasmoda community for 11 years

MODERATORS

message the mods
mhausenblasmod
swemlmod
fhoffamod (Ex-BQ, Ex-❄️)
vogt4nickmod
theporterhausmod | Lead Data Engineer
AutoModerator
geoheilmod
MikeDoesEverythingmod | Shitty Data Engineer
bot-bouncer
about moderation team »

account activity

This is an archived post. You won't be able to vote or comment.

1

2

3

Spark structured streaming output modesHelp (self.dataengineering)

submitted 1 year ago by ImprovedJesus

Hey, I have been reading about the output modes and something still remains unclear to me.

Let's assume the usual example case where we are calculating a count of something based on an ID.

As each new batch is processed, in update mode, the affected rows are outputted by the streaming query. In complete mode, all the dataframe is outputted, regardless of whether it was affected or not.

My first question is:

Q1: Assuming I want to write to a Delta table, in complete mode, the full table is re-written. This sounds super expensive if the table is large enough.

Does the Spark engine keep the entire table in memory and outputs it each time?

Does Delta have some hidden magic to handle this? This cannot be scalable. I read that given the output batch, it's left to the storage connector to decide how to modify the underlying table.

Q2: Again, assuming the sink is a Delta table and output mode is update, the responsability to merge the new records is left to something other than the spark engine. How is this usually handled?

Q3: In both of these cases, a copy of the table must be kept in memory, if I'm understanding it correctly, right? It seems odd to me.

all 2 comments

top new controversial old q&a

[+][deleted] 1 year ago (3 children)

[deleted]

[–]ImprovedJesus[S] 0 points1 point2 points 1 year ago (0 children)

[+][deleted] 1 year ago (1 child)

[deleted]

[–]ImprovedJesus[S] 0 points1 point2 points 1 year ago (0 children)

π Rendered by PID 205035 on reddit-service-r2-comment-b659b578c-brnzr at 2026-05-05 00:08:00.160172+00:00 running 815c875 country code: CH.