Skip to content

Monitoring

Observability for AI agents extends traditional application monitoring to capture AI-specific telemetry: prompts, responses, token counts, tool executions, and multi-step reasoning workflows.

Agent Monitoring

Traditional application monitoring answers "is it working?" AI agent monitoring must also answer "why did it do that?"

Debugging

When an agent produces incorrect output, the cause could be:

  • A poorly constructed prompt
  • Retrieved context that confused the model
  • A tool that returned unexpected data, or accumulated errors across a multi-step chain

With traces, you can inspect exactly what the agent saw and what it decided.

Cost Visibility

LLM APIs charge per token. A single agent session might make dozens of model calls, each with different context sizes. Traces show which operations consume the most tokens, where context windows fill unnecessarily, and cost per task or user.

Latency Analysis

Agent latency includes model inference startup, generation time (proportional to output length), tool execution (external API calls, file operations), and orchestration overhead between steps.

Traces decompose end-to-end latency into components.

Stack Overview

LLM observability builds on distributed tracing concepts but extends them for AI workloads:

┌─────────────────────────────────────┐
│         Agent Application           │
└───────────────┬─────────────────────┘
        ┌───────┴───────┐
        │ OpenTelemetry │  ← Instrumentation standard
        └───────┬───────┘
        ┌───────┴───────┐
        │  OpenInference │  ← AI/ML semantic layer
        └───────┬───────┘
        ┌───────┴───────┐
        │ Arize Phoenix │  ← Visualization platform
        └───────────────┘
Layer Role
OpenTelemetry Vendor-neutral standard for creating and exporting traces
OpenInference Defines AI-specific attributes to capture
Arize Phoenix Consumes these traces for AI-focused analysis

The same instrumentation works with any OTLP backend.

Span Types

OpenInference defines span kinds that correspond to AI operations:

Span Kind Represents
AGENT A reasoning loop that orchestrates other operations
LLM A single model inference call
TOOL Execution of an external function
RETRIEVER Document or data lookup
EMBEDDING Vector generation
CHAIN A pipeline or workflow container

Trace Structure

A trace through an agent session forms a hierarchy:

AGENT (session)
├── LLM (planning)
├── TOOL (file read)
├── LLM (analysis)
├── TOOL (code edit)
└── LLM (summary)

Each span captures its inputs, outputs, timing, and relevant metadata. The hierarchy is causal: which LLM call triggered which tool execution.

Topics

External Resources