Zespan is an AI agent observability and engineering platform. It traces every agent decision, tool call, handoff, and delegation in production. It also provides prompt versioning, built-in LLM-as-judge evaluations, guardrails, cost optimization, and an AI ops assistant called ZespanPilot.

How do I instrument my AI agent with Zespan?

Zespan requires 2 lines of code. Import zespan and call zespan.init({ apiKey: process.env.LT_KEY }). This auto-patches OpenAI, Anthropic, Gemini, Bedrock, and Mistral. For framework-level tracing, add one handler: ZespanCallbackHandler for LangChain, ZespanCrewAIListener for CrewAI, or ZespanADKHandler for Google ADK.

Does Zespan support prompt versioning?

Yes. Zespan includes prompt management with versioning, a playground for iteration, and A/B testing to compare prompt versions against each other in production.

What evaluations does Zespan support?

Zespan ships 12 built-in LLM-as-judge evaluation templates including faithfulness, relevance, toxicity, groundedness, and more. Evaluations run automatically on every trace with no custom scoring functions required.

How does Zespan compare to Langfuse?

Zespan is agent-native: every span carries agent identity, delegations are first-class trace events, and an agent map is built automatically. Langfuse was built for LLM pipelines and extended to agents later. Zespan also ships 12 built-in eval templates (Langfuse has none), includes an AI cost optimizer, and ZespanPilot for AI ops. Langfuse has open-source self-hosting; Zespan does not.

What is the free tier for Zespan?

The free tier includes 10,000 traces per month, 14-day retention, 2 projects, and 1 seat. No credit card required.

Feature — Evaluations

Measure output quality on every trace — automatically.

12 built-in LLM-as-judge templates run on every new trace with no setup. Track quality trends, catch regressions, and run manual eval campaigns.

Auto-evaluators, LLM judge, manual runs, metric timelines — on every plan.

Start for free →Get a demo

zespan.com — evaluations

Works withGPT-4o judgeClaude judgeGemini judgeCustom LLMManual runsDatasets

built-in templates

200

metric keys

0–1

sample rate

Auto-Evaluators

Enable auto-evaluators in project settings and every new trace gets scored automatically — no manual trigger. Configure sample rate (0–1) for high-volume projects and filter to specific models, operations, or statuses.

Runs on every new trace — zero manual work after initial config
Sample rate: evaluate 100% or a fraction of traffic
filterModels, filterOps, filterStatuses: scope to where quality matters most

auto-evaluators

Zespan auto-evaluator configuration with template selection and sample rate

12 Built-In Templates

Pre-built LLM-as-judge templates for the most common quality dimensions: correctness, faithfulness, relevance, toxicity, conciseness, coherence, and more. Each template is configurable per project with custom thresholds.

Quality: correctness, coherence, completeness, conciseness
Safety: toxicity, PII leakage detection, harmful content
RAG-specific: faithfulness (grounded in context?), relevance (answers the question?)

Evaluation Detail & Trends

The evaluation detail view shows per-trace scores with input, output, and the judge's reasoning. Scores trend over time — bucketed by configurable intervals — so you see quality drift as it starts, not weeks later.

Per-trace view: score, judge reasoning, input, output
Timeline view: metric scores bucketed by hour, day, or week
Up to 200 metric keys tracked simultaneously

evaluation detail & trends

Zespan evaluation detail showing per-trace LLM judge score with reasoning

Manual Eval Runs

Trigger evaluation runs on-demand against a dataset or trace set. Runs execute asynchronously via background worker. Browse full run history with status, aggregate scores, and timing.

Run against any dataset or trace selection
Async execution — large runs don't block the UI
Run history: full list with status (queued, running, completed, failed)

Get started

Set up in under 5 minutes

typescriptEvaluations

// Enable auto-evaluators in Project Settings → Evaluations
// Then scores appear on every trace automatically.

// Or attach an eval score manually from your code:
import { Zespan } from '@zespan/sdk';
const lt = new Zespan({ apiKey: process.env.ZESPAN_API_KEY });

await lt.traces.addEvalScore(traceId, {
  metric: 'faithfulness',
  score: 0.92,
  reason: 'Answer matches retrieved context',
});

Start for free →Get a demo

Frequently asked

Which LLM does Zespan use as the eval judge?

The judge model is configurable per evaluator — you can use GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, or any supported provider. You pay for the judge's tokens at standard provider rates; Zespan doesn't mark up model costs.

Does auto-evaluation add latency to my production traces?

No. Evaluation runs asynchronously after the trace is ingested. It never touches your request path. Your users see no latency from evaluations.

Can I use my own evaluation logic instead of the built-in templates?

Yes. Create custom evaluators with your own metric key, description, and judge prompt. You can also attach eval scores directly from your own code using lt.traces.addEvalScore() — Zespan will store and display them alongside auto-eval scores.

What is faithfulness and why does it matter for RAG?

Faithfulness measures whether the model's answer is grounded in the retrieved context or generated from memory (hallucination). For RAG pipelines, a faithfulness score below your threshold is a signal that retrieved context isn't reaching the model properly or the model is ignoring it.

Can I trigger regression detection when deploying a new prompt?

Yes — this is automatic. When you promote a prompt version to the production label, Zespan runs a background regression check comparing eval scores vs. the previous 14 days. If any evaluator drops >10 percentage points, you get a ZespanPilot notification.

Explore more features

Setup takes under 5 minutes. Works with OpenAI, Anthropic, LangChain, and more.

Get started free →Get a demo

← All features