Monitoring¶
Observability for AI agents extends traditional application monitoring to capture AI-specific telemetry: prompts, responses, token counts, tool executions, and multi-step reasoning workflows.
Agent Monitoring¶
Traditional application monitoring answers "is it working?" AI agent monitoring must also answer "why did it do that?"
Debugging¶
When an agent produces incorrect output, the cause could be:
- A poorly constructed prompt
- Retrieved context that confused the model
- A tool that returned unexpected data, or accumulated errors across a multi-step chain
With traces, you can inspect exactly what the agent saw and what it decided.
Cost Visibility¶
LLM APIs charge per token. A single agent session might make dozens of model calls, each with different context sizes. Traces show which operations consume the most tokens, where context windows fill unnecessarily, and cost per task or user.
Latency Analysis¶
Agent latency includes model inference startup, generation time (proportional to output length), tool execution (external API calls, file operations), and orchestration overhead between steps.
Traces decompose end-to-end latency into components.
Stack Overview¶
LLM observability builds on distributed tracing concepts but extends them for AI workloads:
┌─────────────────────────────────────┐
│ Agent Application │
└───────────────┬─────────────────────┘
│
┌───────┴───────┐
│ OpenTelemetry │ ← Instrumentation standard
└───────┬───────┘
│
┌───────┴───────┐
│ OpenInference │ ← AI/ML semantic layer
└───────┬───────┘
│
┌───────┴───────┐
│ Arize Phoenix │ ← Visualization platform
└───────────────┘
| Layer | Role |
|---|---|
| OpenTelemetry | Vendor-neutral standard for creating and exporting traces |
| OpenInference | Defines AI-specific attributes to capture |
| Arize Phoenix | Consumes these traces for AI-focused analysis |
The same instrumentation works with any OTLP backend.
Span Types¶
OpenInference defines span kinds that correspond to AI operations:
| Span Kind | Represents |
|---|---|
AGENT |
A reasoning loop that orchestrates other operations |
LLM |
A single model inference call |
TOOL |
Execution of an external function |
RETRIEVER |
Document or data lookup |
EMBEDDING |
Vector generation |
CHAIN |
A pipeline or workflow container |
Trace Structure¶
A trace through an agent session forms a hierarchy:
AGENT (session)
├── LLM (planning)
├── TOOL (file read)
├── LLM (analysis)
├── TOOL (code edit)
└── LLM (summary)
Each span captures its inputs, outputs, timing, and relevant metadata. The hierarchy is causal: which LLM call triggered which tool execution.
Topics¶
- OpenTelemetry - The instrumentation standard underlying LLM observability
- GenAI Conventions - Semantic conventions for AI operations
- Arize Phoenix - Open-source observability platform for LLM traces