Monitoring¶

Observability for AI agents extends traditional application monitoring to capture AI-specific telemetry: prompts, responses, token counts, tool executions, and multi-step reasoning workflows.

Agent Monitoring¶

Traditional application monitoring answers "is it working?" AI agent monitoring must also answer "why did it do that?"

Debugging¶

When an agent produces incorrect output, the cause could be:

A poorly constructed prompt
Retrieved context that confused the model
A tool that returned unexpected data, or accumulated errors across a multi-step chain

With traces, you can inspect exactly what the agent saw and what it decided.

Cost Visibility¶

LLM APIs charge per token. A single agent session might make dozens of model calls, each with different context sizes. Traces show which operations consume the most tokens, where context windows fill unnecessarily, and cost per task or user.

Latency Analysis¶

Agent latency includes model inference startup, generation time (proportional to output length), tool execution (external API calls, file operations), and orchestration overhead between steps.

Traces decompose end-to-end latency into components.

Stack Overview¶

LLM observability builds on distributed tracing concepts but extends them for AI workloads:

┌─────────────────────────────────────┐
│         Agent Application           │
└───────────────┬─────────────────────┘
                │
        ┌───────┴───────┐
        │ OpenTelemetry │  ← Instrumentation standard
        └───────┬───────┘
                │
        ┌───────┴───────┐
        │  OpenInference │  ← AI/ML semantic layer
        └───────┬───────┘
                │
        ┌───────┴───────┐
        │ Arize Phoenix │  ← Visualization platform
        └───────────────┘

Layer	Role
OpenTelemetry	Vendor-neutral standard for creating and exporting traces
OpenInference	Defines AI-specific attributes to capture
Arize Phoenix	Consumes these traces for AI-focused analysis

The same instrumentation works with any OTLP backend.

Span Types¶

OpenInference defines span kinds that correspond to AI operations:

Span Kind	Represents
`AGENT`	A reasoning loop that orchestrates other operations
`LLM`	A single model inference call
`TOOL`	Execution of an external function
`RETRIEVER`	Document or data lookup
`EMBEDDING`	Vector generation
`CHAIN`	A pipeline or workflow container

Trace Structure¶

A trace through an agent session forms a hierarchy:

AGENT (session)
├── LLM (planning)
├── TOOL (file read)
├── LLM (analysis)
├── TOOL (code edit)
└── LLM (summary)

Each span captures its inputs, outputs, timing, and relevant metadata. The hierarchy is causal: which LLM call triggered which tool execution.

Topics¶

OpenTelemetry - The instrumentation standard underlying LLM observability
GenAI Conventions - Semantic conventions for AI operations
Arize Phoenix - Open-source observability platform for LLM traces