Zespan is an AI agent observability and engineering platform. It traces every agent decision, tool call, handoff, and delegation in production. It also provides prompt versioning, built-in LLM-as-judge evaluations, guardrails, cost optimization, and an AI ops assistant called ZespanPilot.

How do I instrument my AI agent with Zespan?

Zespan requires 2 lines of code. Import zespan and call zespan.init({ apiKey: process.env.LT_KEY }). This auto-patches OpenAI, Anthropic, Gemini, Bedrock, and Mistral. For framework-level tracing, add one handler: ZespanCallbackHandler for LangChain, ZespanCrewAIListener for CrewAI, or ZespanADKHandler for Google ADK.

Does Zespan support prompt versioning?

Yes. Zespan includes prompt management with versioning, a playground for iteration, and A/B testing to compare prompt versions against each other in production.

What evaluations does Zespan support?

Zespan ships 12 built-in LLM-as-judge evaluation templates including faithfulness, relevance, toxicity, groundedness, and more. Evaluations run automatically on every trace with no custom scoring functions required.

How does Zespan compare to Langfuse?

Zespan is agent-native: every span carries agent identity, delegations are first-class trace events, and an agent map is built automatically. Langfuse was built for LLM pipelines and extended to agents later. Zespan also ships 12 built-in eval templates (Langfuse has none), includes an AI cost optimizer, and ZespanPilot for AI ops. Langfuse has open-source self-hosting; Zespan does not.

What is the free tier for Zespan?

The free tier includes 10,000 traces per month, 14-day retention, 2 projects, and 1 seat. No credit card required.

Feature — Playground

Find prompt failures in the sandbox, not in production.

Test prompts across 4 providers with real streaming, tool calls, structured output, and your actual guardrails — before any code ships.

OpenAI, Anthropic, Google, OpenRouter. Tool calls. Structured output. Guardrail integration.

Start for free →Get a demo

zespan.com — playground

Works withOpenAIAnthropicGoogle GenAIOpenRouterTool callsStructured output

providers

100+

models

Live

streaming

4 Providers, 100+ Models

OpenAI (GPT-4o, GPT-4o-mini, o1, o3), Anthropic (Claude 3.5 Sonnet, Claude 3 Opus, Haiku), Google (Gemini 1.5 Pro, Gemini 1.5 Flash), and OpenRouter (100+ models). Available models are fetched dynamically — always current.

OpenAI: GPT-4o, GPT-4o-mini, o1, o3-mini, GPT-4-turbo
Anthropic: Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku
Google: Gemini 1.5 Pro, Gemini 1.5 Flash and all Google GenAI models
OpenRouter: 100+ models (Llama 3, Mixtral, Yi, DeepSeek, and more)

Tool Calls & Structured Output

Pass tool/function schemas to test tool-calling before wiring up real integrations. Pass a JSON schema to enforce structured output and validate compliance immediately. Catch schema mismatches and tool argument errors before they reach production.

Tool definitions: pass function schemas, see tool call arguments and results
Structured output: pass a JSON schema — see if the model complies
Multiple tool calls: test models that invoke multiple tools in sequence

Streaming & Config Overrides

Stream completions token by token — identical to production streaming behavior. Override temperature, max_tokens, top_p, and any provider-specific parameter to fine-tune behavior in the sandbox.

Real-time token streaming — same experience as production
Config overrides: temperature, max_tokens, top_p, and provider params
Text mode and Chat mode (system/user/assistant/tool message array)

Guardrail Integration

Apply your project's guardrails to Playground runs. The same PII, toxicity, topic boundary, and custom rules that run in production run in the sandbox. Test prompt safety interactively before deploying.

applyGuardrails: true — applies all project guardrails to playground runs
See block/warn/redact behavior before any prompt reaches production
Test guardrail rules against new prompts without a live request

Get started

Set up in under 5 minutes

typescriptPlayground

// Playground is in-product — no SDK setup required.
// Access it from the sidebar: Playground.

// What you can test:
// - Text mode: single string prompt
// - Chat mode: multi-turn message array (system / user / assistant / tool)
// - Tool definitions: pass function schemas to test tool-calling
// - JSON schema output: validate structured output compliance
// - Guardrails: apply project guardrails to sandbox runs

Start for free →Get a demo

Frequently asked

Do Playground runs appear in my trace data?

Yes. Playground runs are traced like any other LLM call. You can find them in the Trace Explorer filtered by environment=playground or operation=playground-run.

Do I need API keys for each provider?

Yes. Each provider (OpenAI, Anthropic, Google) requires its own API key, which you configure in Project Settings → Providers. Zespan doesn't proxy through its own API keys for providers.

What's the difference between Chat mode and Text mode?

Text mode is a single string prompt — equivalent to a completion or a system prompt. Chat mode is a multi-turn message array with system, user, assistant, and tool roles — equivalent to the chat completions API. Use chat mode to test multi-turn conversations and system prompt behavior.

Can I test a prompt in the Playground before promoting it to production?

Yes, and this is the intended workflow. Load the prompt version from Prompt Management into the Playground, test it with guardrails enabled, and if it passes, promote it to the production label. The Playground is your manual safety check; Simulations are your automated check.

Explore more features

Setup takes under 5 minutes. Works with OpenAI, Anthropic, LangChain, and more.

Get started free →Get a demo

← All features

Find prompt failures in the sandbox, not in production.

4 Providers, 100+ Models

Tool Calls & Structured Output

Streaming & Config Overrides

Guardrail Integration

Do Playground runs appear in my trace data?

Do I need API keys for each provider?

What's the difference between Chat mode and Text mode?

Can I test a prompt in the Playground before promoting it to production?

Tracing

Agent Monitoring

Evaluations

Guardrails