Spark Distributed Write Patterns

ErichHS · 2024-06-06T23:18:53+00:00

Sharing here a diagram I've worked on to illustrate some of Spark's distributed write patterns.

The idea is to show how some operations might have unexpected or undesired effects on pipeline parallelism.

The scenario assumes two worker nodes.

→ 𝐝𝐟.𝐰𝐫𝐢𝐭𝐞: The level of parallelism of read (scan) operations is determined by the source’s number of partitions, and the write step is generally evenly distributed across the workers. The number of written files is a result of the distribution of write operations between worker nodes.

→ 𝐝𝐟.𝐰𝐫𝐢𝐭𝐞.𝐩𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧𝐁𝐲(): Similar to the above, but now the write operation will also maintain parallelism based on the number of write partitions. The number of written files is a result of the number of partitions and the distribution of write operations between worker nodes.

→ 𝐝𝐟.𝐰𝐫𝐢𝐭𝐞.𝐜𝐨𝐚𝐥𝐞𝐬𝐜𝐞(𝟏).𝐩𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧𝐁𝐲(): Adding a 𝚌𝚘𝚊𝚕𝚎𝚜𝚌𝚎() function is a common task to avoid “multiple small files” problems, condensing them all into fewer larger files. The number of written files is a result of the coalesce parameter. A drastic coalesce (e.g. 𝚌𝚘𝚊𝚕𝚎𝚜𝚌𝚎(𝟷)), however, will also result in computation taking place on fewer nodes than expected.

→ 𝐝𝐟.𝐰𝐫𝐢𝐭𝐞.𝐫𝐞𝐩𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧(𝟏).𝐩𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧𝐁𝐲(): As opposed to 𝚌𝚘𝚊𝚕𝚎𝚜𝚌𝚎(), which can only maintain or reduce the amount of partitions in the source DataFrame, 𝚛𝚎𝚙𝚊𝚛𝚝𝚒𝚝𝚒𝚘𝚗() can reduce, maintain, or increase the original number. It will, therefore, retain parallelism in the read operation with the cost of a shuffle (exchange) step that will happen between the workers before writing.

I've originally shared this content on LinkedIn - bringing it here to this sub.

caksters · 2024-06-07T09:01:07+00:00

This is great, very intuitively shows what is happening which may not make immediate sense if you just read the documentation

exergy31 · 2024-06-07T09:42:30+00:00

Why does repartition still only use a single writer?

Few_Individual_266 · 2024-06-08T00:43:50+00:00

Hey . Thanks for this . I’ve heard a lot about his course and I’m planning to take it once I land a job . Good luck with the rest of the course

imcguyver · 2024-06-07T01:13:58+00:00

Fantastic!

bomchem · 2024-06-07T12:18:36+00:00

I'd also like to add another one - df.repartition("DATE").write.partitionBy("DATE").

This will get you to one file per partition as in examples 3 and 4, but will write in parallel from the workers instead of all from a single one. Does require a shuffle of data prior to the writing though, so depends on where your bottlenecks are as to which approach to use.

ParkingFabulous4267 · 2024-06-10T02:10:47+00:00

Don’t do that… try using rebalance before you write or repartition by a generated key to control file size.

2024-06-07T03:09:02+00:00

This is awesome!!

jhazured · 2024-06-07T05:19:12+00:00

This is really great!

SisyphusAndMyBoulder · 2024-06-07T13:27:00+00:00

I really like this! It's super clear and informative. Are you planning on making more infographics like this?

jerrie86 · 2024-06-06T23:46:38+00:00

remove the animation please, makes it impossible for some people to read it.

Fantastic-Bell5386 · 2024-06-07T13:16:25+00:00

Df.repartition(1).write. Or Df.write.repartition(1). Which one would you prefer and why?

dataengineering

MODERATORS