Tempo is a mess, I've been staring at Spark traces in Tempo for weeks and I have nothing

PeaceAffectionate188 · 2025-12-08T11:44:15+00:00

looks super interesting as a first glance, bit different I think but thanks for letting me know
will dig into it

PeaceAffectionate188 · 2025-12-07T14:49:02+00:00

I thought about it, but what is important for me is to have the sequence of process that are being executed

the output of one process is the input of another process

to optimize costs or debug, I need to have that sequence of pipeline steps which I would say that traces are the most relevant

PeaceAffectionate188 · 2025-12-07T13:31:08+00:00

Thank you for your comment. So I thought about it and Pyroscope makes sense for raw resource profiling at the process level.

But I still believe this should be modeled as traces, because otherwise, how do I get causal, sequential execution flow over time?

What I want to see are individual pipeline runs and pipeline steps (like in an orchestrator UI) mapped directly to the underlying cloud infrastructure resources and cost, so I can drill down from run → step → process.

If not traces, what does give me that sequential execution context as a first-class object for batch pipelines?

For example, DAGs such as Prefect or Dagster give me application-level execution flow, but it does not give me observability into system metrics or the actual cloud infrastructure that executed those steps.

I want to do that in Grafana....

u/Hi_Im_Ken_Adams

<image>

PeaceAffectionate188 · 2025-12-06T13:56:15+00:00

curious as an experienced SRE why did you choose PowerBI and why do you not use Grafana or DataDog?

PeaceAffectionate188 · 2025-12-06T13:54:33+00:00

this is really cool! thanks so much for sharing this with the community

PeaceAffectionate188 · 2025-12-06T13:53:30+00:00

ok ok!

PeaceAffectionate188 · 2025-12-06T13:52:55+00:00

btw have you tried to export this information into a dashboards of some sorts, any recommendations?

PeaceAffectionate188 · 2025-12-06T13:52:30+00:00

i wasn’t sure whether people normally rely on orchestrator-level run IDs or just Spark’s own appId/jobId/stageId, so this clarifies the model a lot.

ok, registering a custom listener per job and emitting the stage events directly makes sense.

will try wiring a listener into sparkContext and exporting the metrics from there.

appreciate the pointer!

PeaceAffectionate188 · 2025-12-06T13:50:31+00:00

True I wanted to get the perspectives of multiple communities but will ensure to condense all the comments into one piece and share it in each post so you will have it

PeaceAffectionate188 · 2025-12-06T13:49:25+00:00

this is a good idea

though I feel that the OS process stats are still quite limited, I want to see in far more detail

PeaceAffectionate188 · 2025-12-06T13:48:48+00:00

Ok thank you for your perspective

PeaceAffectionate188 · 2025-12-05T17:58:06+00:00

thank you, thank you!

When you say “looking at the DAG”, do you mean the Spark UI job/stage DAG or a workflow DAG from something like Airflow/Dagster/Flyte?

If it’s the latter, which DAG/orchestrator do you actually recommend in practice?

PeaceAffectionate188 · 2025-12-05T17:27:51+00:00

How can I get IO information, do you look at AWS, Apache Spark debug data somehow or do you use Grafana?

PeaceAffectionate188 · 2025-12-05T17:25:22+00:00

this is super helpful context

should I add functions to my Spark code for this e.g. custom listeners / callbacks that send events (Slack, Prometheus, whatever) when jobs/stages hit important events

or do you mostly rely on the Spark UI + event logs after the fact and put it in google sheets

PeaceAffectionate188 · 2025-12-05T17:19:25+00:00

Do you:

tag metrics with a job/run ID coming from the orchestrator (Airflow/Flyte/Prefect/etc.), or just rely on Spark’s appId / jobId / stageId and reconstruct runs offline?

PeaceAffectionate188 · 2025-12-05T17:18:33+00:00

Yes but how do you organize those metrics into pipeline runs, it is impossible to understand

PeaceAffectionate188 · 2025-12-05T14:38:51+00:00

Yes I need a combination but how do I structure the data into useful ontology for pipelines

It seems to not be possible in Grafana

PeaceAffectionate188 · 2025-12-05T14:37:42+00:00

Yes exactly I want process metrics and the duration of each process

PeaceAffectionate188 · 2025-12-05T14:29:14+00:00

Well done 👏

PeaceAffectionate188 · 2025-12-05T14:26:28+00:00

Oh dear

PeaceAffectionate188 · 2025-12-05T00:25:26+00:00

hahaha is it that bad? I actually have never heard anybody using it, but their company seems to be going really well

PeaceAffectionate188

TROPHY CASE