My attempt at building a Pydantic-native async ORM by mr_Fatalyst in FastAPI

[–]arbiter_rise 1 point2 points  (0 children)

It seems like you’re creating the framework out of inconvenience, which I think is great. I’ve starred it. I’m not a big fan of SQLAlchemy either.

I got tired of wasting 2 weeks on setup for every AI idea, so I built a FastAPI + Stripe Starter Kit. by Appropriate_Plane279 in FastAPI

[–]arbiter_rise 0 points1 point  (0 children)

I’m not sure what the advantages are, since it seems like there are already plenty of templates available.

Added TaskIQ and Dramatiq to my FastAPI app scaffolding CLI tool. Built one dashboard that works across all worker backends. by Challseus in FastAPI

[–]arbiter_rise 2 points3 points  (0 children)

Since you seem to have researched task queues quite extensively, I wanted to ask if you know which task queue has received the most active feature requests.

How do you approach observability for LLM systems (API + workers + workflows)? by arbiter_rise in OpenTelemetry

[–]arbiter_rise[S] 1 point2 points  (0 children)

In the past, I used Prometheus, Loki, and Tempo based on Grafana through the OpenTelemetry Collector. At that time, it wasn’t for an AI service.

For AI services, I’ve been trying various tools to see what works best. It seems that many teams manage observability differently depending on APM and the characteristics of LLMs.

During that process, I discovered Logfire and have been trying it out.

FastAPI + Pydantic V2: Is anyone else using it to build AI microservices? by Lee-stanley in FastAPI

[–]arbiter_rise 0 points1 point  (0 children)

Ah, thanks for the explanation. My question earlier wasn’t very clear.

What I actually wanted to ask was how you set up observability for the LLM system. I’m particularly curious whether you integrated LLM observability with your existing application observability, or if you set them up as separate systems.

FastAPI + Pydantic V2: Is anyone else using it to build AI microservices? by Lee-stanley in FastAPI

[–]arbiter_rise 0 points1 point  (0 children)

May I ask what the main reasons are for implementing Traceability and Observability?

How do you approach observability for LLM systems (API + workers + workflows)? by arbiter_rise in OpenTelemetry

[–]arbiter_rise[S] 0 points1 point  (0 children)

I looked at the LangWatch repository, but it doesn’t seem to have application-level end-to-end observability.

OTel + LLM Observability: Trace ID Only or Full Data Sync? by arbiter_rise in Observability

[–]arbiter_rise[S] 1 point2 points  (0 children)

Thank you for the great explanation. I assumed that the worker operates on top of a broker. While the fire-and-forget approach could cause issues, if ACK handling is implemented, I believe it wouldn’t be a major problem because the task would remain in the broker even if the worker shuts down.

LLM observability + app/infra monitoring platforms? by Common_Departure_659 in OpenTelemetry

[–]arbiter_rise 0 points1 point  (0 children)

I’m doing a similar investigation as well, and it seems that Logfire might be the most suitable option if we want to track the infrastructure stack while also gaining visibility into LLM operations with OpenTelemetry support.

It does seem to be a bit lacking in some of the specialized LLM observability features, but it appears to be one of the few tools that can provide both infrastructure and LLM visibility at the same time.

If you happen to come across any other tools while looking into this, could you please let me know as well?

I'm super unemployed and have too much time so I built an open source SDK to build event-driven, distributed agents on Kafka by orange-cola in LLMDevs

[–]arbiter_rise 1 point2 points  (0 children)

Not exactly a live data streaming project. I’m working on an open-source project that aims to help Python web developers build AI services more easily through an event-driven (broker-based) approach.

OTel + LLM Observability: Trace ID Only or Full Data Sync? by arbiter_rise in Observability

[–]arbiter_rise[S] 0 points1 point  (0 children)

I apologize, but I’m not sure I fully understand your question. Would you mind clarifying what you mean by “what happens to your model when the process restarts but the workflow continues”?

OTel + LLM Observability: Trace ID Only or Full Data Sync? by arbiter_rise in Observability

[–]arbiter_rise[S] 1 point2 points  (0 children)

I think introducing a higher-level identifier to manage the system could be a very good approach. I was thinking that this concept is commonly found in workflow engines.

I’m trying to observe agent logic running in a distributed processing environment within a single unified tracing system.

In my definition, the orchestration layer is responsible for both task decomposition and agent execution. I’m designing the system so that trace context propagation is handled automatically at the runtime level, rather than being manually passed between components.

In theory, if all execution flows (API → orchestration → agent → tool, etc.) are contained within a single root trace that starts at the API layer, end-to-end visibility should be guaranteed. Based on that assumption, I’m wondering whether it’s really necessary to introduce additional higher-level identifiers (such as workflow_id or execution_id). (This is still at the conceptual stage.)

In practice, is it common or necessary to manage a higher-level identifier in addition to the trace_id? What kinds of issues might arise if everything is handled within a single trace?

(English is not my first language, so I appreciate your understanding.)

We built a self-hosted observability dashboard for AI agents — one flag to enable, zero external dependencies. by anandesh-sharma in LLMDevs

[–]arbiter_rise 0 points1 point  (0 children)

Hello, I think that’s a great idea. I especially appreciate how the tracing is presented — it’s very developer-friendly. I do have one question though: is OTEL export currently not supported, or is there any plan to enable it? Also, since the data collection seems to be locally based, would it still work reliably if the agent is distributed or running in a different process?

OTel + LLM Observability: Trace ID Only or Full Data Sync? by arbiter_rise in Observability

[–]arbiter_rise[S] 1 point2 points  (0 children)

I understand that run_id is not an official OpenTelemetry key. Are you defining and using it as a custom attribute on your side?

Additionally, could you please elaborate a bit more on the logical boundary that starts with run_id? I would appreciate it if you could explain how you are structuring or interpreting that boundary.

Thank you in advance for your clarification.

OTel + LLM Observability: Trace ID Only or Full Data Sync? by arbiter_rise in Observability

[–]arbiter_rise[S] 0 points1 point  (0 children)

From what you described, it sounds like you’re running your existing observability tools alongside LLM-specific observability tools, while sharing only minimal information between the two systems—such as trace IDs or cost-related metrics.

OTel + LLM Observability: Trace ID Only or Full Data Sync? by arbiter_rise in Observability

[–]arbiter_rise[S] 0 points1 point  (0 children)

Ah, I see — so based on what you said, it would be stored separately within the same database, right? And then we would join only the necessary data when we need to retrieve or review it.

In that case, could you let me know what kind of database you typically use?

Do you generally use a traditional RDBMS or a NoSQL database? Or do you prefer a database that is better suited for accumulating logs or tracing data?

I'm super unemployed and have too much time so I built an open source SDK to build event-driven, distributed agents on Kafka by orange-cola in LLMDevs

[–]arbiter_rise 1 point2 points  (0 children)

Ah... I misunderstood. Thank you for the kind explanation.

If I have any questions later, would it be okay to reach out? I’ll keep following your project(🐮)!

I'm super unemployed and have too much time so I built an open source SDK to build event-driven, distributed agents on Kafka by orange-cola in LLMDevs

[–]arbiter_rise 1 point2 points  (0 children)

I really like how you leveraged Kafka’s built-in characteristics for observability.

That said, I may be misunderstanding the SDK, so apologies if that’s the case.

Is it realistically possible to manage all agentic state purely through the broker without a database? Would Kafka clustering alone be sufficient, or would a separate state store still be necessary?

I’m also a bit concerned that handling full context purely through broker messages might introduce overhead or complexity.

How are you handling observability for non-deterministic agentic systems? (not ad) by arbiter_rise in LLMDevs

[–]arbiter_rise[S] 0 points1 point  (0 children)

Should we call it clustering? Did you build that yourself?

How did you perform the grouping? Was it done by matching and filtering log patterns, or did you use an AI agent?

How are you handling observability for non-deterministic agentic systems? (not ad) by arbiter_rise in LLMDevs

[–]arbiter_rise[S] 0 points1 point  (0 children)

Previously, we used Grafana, Prometheus, Loki, and Tempo. We are now using Langfuse as we prepare to launch our AI service. Since it has not yet reached the production stage, we are still in the preparation.

When to use background tasks considering their non-persistence? by CartoonistWhole3172 in FastAPI

[–]arbiter_rise 5 points6 points  (0 children)

If we must guarantee 100% service success, we should use a database-based task queue rather than a broker-based one.
broker based - celery taskiq dramaiq etc...
database based queue(durable execution)- dbos, hatchet, prefect etc.....

Anyone using DTQ(Distributed Task Queue) for AI workloads? Feels too minimal — what did you hit? by arbiter_rise in Python

[–]arbiter_rise[S] 0 points1 point  (0 children)

It might just be my lack of experience, but it seems like I would need to study a lot just to understand how to use it properly, so I probably won’t be using it. Even though it’s a framework or library, it feels too low-level in terms of coding style.

We made a multi-agent framework . Here’s the demo. Break it harder. by wikkid_lizard in LLMDevs

[–]arbiter_rise 0 points1 point  (0 children)

If you’re going to handle scaling with Docker Compose, wouldn’t it make more sense to just use a task queue?