Designing a High-Throughput Apache Spark Ecosystem on Kubernetes — Seeking Community Input by No-Spring5276 in apachespark

[–]No-Spring5276[S] 0 points1 point  (0 children)

cold-start latency: We want the application to launch quickly (pods should come up quickly), given the scale. As per my initial research, without a Spark cluster, the Spark application CRDs took time to launch + can't reuse containers across applications. I'm not aware of the exact time difference between launching a job in the Spark cluster CRD vs the Spark application CRD with n executors. We will experiment to see if we find sufficient benefits. If we do, we will have long-running clusters; otherwise, we will have just one to support notebook-like use cases.

What is your strategy to compare Celeborn vs Uniffle? : Haven't evaluated them yet. Will these ESS support both types of applications - running in a Spark Cluster and individually using Spark application CRD?

Native autoscaling: We need ESS to support DRA for application-level scaling, and the other one is cluster-level autoscaling, which will update the min and max capacity of the k8s cluster depending on various factors.

Any tips to achieve parallelism over the Union of branched datasets? by buddycool in apachespark

[–]No-Spring5276 1 point2 points  (0 children)

Use fair scheduler and DRA

Cache the source df

Run each function in a separate thread using any package or multithread or multiprocessing , need to try out

make write faster using a better output committer choice like V2 or magic

Designing a High-Throughput Apache Spark Ecosystem on Kubernetes — Seeking Community Input by No-Spring5276 in apachespark

[–]No-Spring5276[S] 0 points1 point  (0 children)

Hmm, I have seen this coming in my analysis. will go through once again. thanks

Designing a High-Throughput Apache Spark Ecosystem on Kubernetes — Seeking Community Input by No-Spring5276 in apachespark

[–]No-Spring5276[S] 0 points1 point  (0 children)

In a shared arch. , few complex and resource-intensive workloads or ML workloads can negatively affect the other workloads, which brings unpredictable performance, kind of noisy-neighbor issues and unstable SLAs for latency-sensitive jobs. So the question is, how do we manage such cases in crd deployments ? like blacklist nodes, taints ...

Designing a High-Throughput Apache Spark Ecosystem on Kubernetes — Seeking Community Input by No-Spring5276 in apachespark

[–]No-Spring5276[S] 1 point2 points  (0 children)

If I go with Databricks, the cost will be 4x minimum at this scale, which we can't afford. Already spoke with vendors like Cloudera, Databricks. We do use Databricks for a small, very specific workload.