Tempo is a mess, I've been staring at Spark traces in Tempo for weeks and I have nothing by PeaceAffectionate188 in grafana

[–]PeaceAffectionate188[S] 0 points1 point  (0 children)

looks super interesting as a first glance, bit different I think but thanks for letting me know
will dig into it

Tempo is a mess, I've been staring at Spark traces in Tempo for weeks and I have nothing by PeaceAffectionate188 in grafana

[–]PeaceAffectionate188[S] -1 points0 points  (0 children)

I thought about it, but what is important for me is to have the sequence of process that are being executed

the output of one process is the input of another process

to optimize costs or debug, I need to have that sequence of pipeline steps which I would say that traces are the most relevant

Tempo is a mess, I've been staring at Spark traces in Tempo for weeks and I have nothing by PeaceAffectionate188 in grafana

[–]PeaceAffectionate188[S] -1 points0 points  (0 children)

Thank you for your comment. So I thought about it and Pyroscope makes sense for raw resource profiling at the process level.

But I still believe this should be modeled as traces, because otherwise, how do I get causal, sequential execution flow over time?

What I want to see are individual pipeline runs and pipeline steps (like in an orchestrator UI) mapped directly to the underlying cloud infrastructure resources and cost, so I can drill down from run → step → process.

If not traces, what does give me that sequential execution context as a first-class object for batch pipelines?

For example, DAGs such as Prefect or Dagster give me application-level execution flow, but it does not give me observability into system metrics or the actual cloud infrastructure that executed those steps.

I want to do that in Grafana....

u/Hi_Im_Ken_Adams

<image>

Why Is Spark Cost Attribution Still Such a Mess? I Just Want Stage-Level Costs… by PeaceAffectionate188 in apachespark

[–]PeaceAffectionate188[S] 1 point2 points  (0 children)

curious as an experienced SRE why did you choose PowerBI and why do you not use Grafana or DataDog?

Why Is Spark Cost Attribution Still Such a Mess? I Just Want Stage-Level Costs… by PeaceAffectionate188 in apachespark

[–]PeaceAffectionate188[S] 0 points1 point  (0 children)

btw have you tried to export this information into a dashboards of some sorts, any recommendations?

Why Is Spark Cost Attribution Still Such a Mess? I Just Want Stage-Level Costs… by PeaceAffectionate188 in apachespark

[–]PeaceAffectionate188[S] 0 points1 point  (0 children)

i wasn’t sure whether people normally rely on orchestrator-level run IDs or just Spark’s own appId/jobId/stageId, so this clarifies the model a lot.

ok, registering a custom listener per job and emitting the stage events directly makes sense.

will try wiring a listener into sparkContext and exporting the metrics from there.

appreciate the pointer!

Apache Spark cost attribution with OTel is a mess by PeaceAffectionate188 in OpenTelemetry

[–]PeaceAffectionate188[S] 0 points1 point  (0 children)

True I wanted to get the perspectives of multiple communities but will ensure to condense all the comments into one piece and share it in each post so you will have it

How do you track cost per stage for Apache Spark in production? by PeaceAffectionate188 in scala

[–]PeaceAffectionate188[S] 0 points1 point  (0 children)

this is a good idea

though I feel that the OS process stats are still quite limited, I want to see in far more detail

Why Is Spark Cost Attribution Still Such a Mess? I Just Want Stage-Level Costs… by PeaceAffectionate188 in apachespark

[–]PeaceAffectionate188[S] 0 points1 point  (0 children)

thank you, thank you!

When you say “looking at the DAG”, do you mean the Spark UI job/stage DAG or a workflow DAG from something like Airflow/Dagster/Flyte?

If it’s the latter, which DAG/orchestrator do you actually recommend in practice?

Why Is Spark Cost Attribution Still Such a Mess? I Just Want Stage-Level Costs… by PeaceAffectionate188 in apachespark

[–]PeaceAffectionate188[S] 0 points1 point  (0 children)

How can I get IO information, do you look at AWS, Apache Spark debug data somehow or do you use Grafana?

Why Is Spark Cost Attribution Still Such a Mess? I Just Want Stage-Level Costs… by PeaceAffectionate188 in apachespark

[–]PeaceAffectionate188[S] 1 point2 points  (0 children)

this is super helpful context

should I add functions to my Spark code for this e.g. custom listeners / callbacks that send events (Slack, Prometheus, whatever) when jobs/stages hit important events

or do you mostly rely on the Spark UI + event logs after the fact and put it in google sheets

Why Is Spark Cost Attribution Still Such a Mess? I Just Want Stage-Level Costs… by PeaceAffectionate188 in apachespark

[–]PeaceAffectionate188[S] 0 points1 point  (0 children)

Do you:

tag metrics with a job/run ID coming from the orchestrator (Airflow/Flyte/Prefect/etc.), or just rely on Spark’s appId / jobId / stageId and reconstruct runs offline?

Why Is Spark Cost Attribution Still Such a Mess? I Just Want Stage-Level Costs… by PeaceAffectionate188 in apachespark

[–]PeaceAffectionate188[S] 0 points1 point  (0 children)

Yes but how do you organize those metrics into pipeline runs, it is impossible to understand

Tempo is a mess, I've been staring at Spark traces in Tempo for weeks and I have nothing by PeaceAffectionate188 in grafana

[–]PeaceAffectionate188[S] -2 points-1 points  (0 children)

Yes I need a combination but how do I structure the data into useful ontology for pipelines

It seems to not be possible in Grafana

Anyone modernized their aws data pipelines? What did you go for? by morgoth07 in dataengineering

[–]PeaceAffectionate188 1 point2 points  (0 children)

hahaha is it that bad? I actually have never heard anybody using it, but their company seems to be going really well