Zespan is an AI agent observability and engineering platform. It traces every agent decision, tool call, handoff, and delegation in production. It also provides prompt versioning, built-in LLM-as-judge evaluations, guardrails, cost optimization, and an AI ops assistant called ZespanPilot.

How do I instrument my AI agent with Zespan?

Zespan requires 2 lines of code. Import zespan and call zespan.init({ apiKey: process.env.LT_KEY }). This auto-patches OpenAI, Anthropic, Gemini, Bedrock, and Mistral. For framework-level tracing, add one handler: ZespanCallbackHandler for LangChain, ZespanCrewAIListener for CrewAI, or ZespanADKHandler for Google ADK.

Does Zespan support prompt versioning?

Yes. Zespan includes prompt management with versioning, a playground for iteration, and A/B testing to compare prompt versions against each other in production.

What evaluations does Zespan support?

Zespan ships 12 built-in LLM-as-judge evaluation templates including faithfulness, relevance, toxicity, groundedness, and more. Evaluations run automatically on every trace with no custom scoring functions required.

How does Zespan compare to Langfuse?

Zespan is agent-native: every span carries agent identity, delegations are first-class trace events, and an agent map is built automatically. Langfuse was built for LLM pipelines and extended to agents later. Zespan also ships 12 built-in eval templates (Langfuse has none), includes an AI cost optimizer, and ZespanPilot for AI ops. Langfuse has open-source self-hosting; Zespan does not.

What is the free tier for Zespan?

The free tier includes 10,000 traces per month, 14-day retention, 2 projects, and 1 seat. No credit card required.

Feature — Simulations

Test your AI app against real data before deploying.

Run up to 100 scenarios per batch against named datasets. Turn production failures into regression tests in one click.

Prompt, HTTP, and conversation scenarios. Batch runs. Trace-to-dataset conversion.

Start for free →Get a demo

zespan.com — simulations

Works withPrompt scenariosHTTP scenariosConversation scenariosLLM evaluatorBatch runsTrace-to-dataset

100

scenarios per batch

500

items per dataset

scenario types

3 Scenario Types

Prompt scenarios run a template against input items. HTTP scenarios call an external endpoint and evaluate the response. Conversation scenarios simulate multi-turn exchanges end to end. Attach a default dataset and custom LLM evaluator per scenario.

Prompt: run a template against each dataset item
HTTP: call an endpoint and assert on the response
Conversation: multi-turn simulation with assertions per turn

3 scenario types

Zespan simulations view showing scenario list with types, statuses, and evaluator config

Datasets

Named datasets hold your test inputs with optional expected outputs and metadata. Add up to 500 items per call. Datasets persist per project and can be reused across multiple simulation scenarios.

Named datasets: per-project, browsable, deletable
Items: input, expectedOutput (optional), metadata (optional)
500 items per call — bulk import supported

datasets

Trace-to-Dataset

Select any production trace and add it to a dataset in one click. Turn real failures, edge cases, and high-cost outliers into regression tests without copy-pasting. Build test coverage from incidents as they happen.

From Trace Explorer: select traces → 'Add to dataset'
Mark expected output: annotate what the correct answer should have been
Instant regression suite: production failures become test cases automatically

trace-to-dataset

Batch Runs & Progress

Run up to 100 scenarios in a single batch with a full dataset as input. Runs execute asynchronously with real-time progress tracking. Tag batches with experiment labels for comparison across changes.

Up to 100 scenarios per batch, dataset-fed
Real-time progress: refreshBatchRunProgress for live status
Experiment labels: tag batches for grouping and head-to-head comparison

Get started

Set up in under 5 minutes

typescriptSimulations

// Create a dataset via API
import { Zespan } from '@zespan/sdk';
const lt = new Zespan({ apiKey: process.env.ZESPAN_API_KEY });

await lt.datasets.addItems('my-regression-suite', [
  { input: 'How do I reset my password?', expectedOutput: '...' },
  { input: 'Cancel my subscription', expectedOutput: '...' },
]);

// Or create from production traces in one click in the UI:
// Trace Explorer → select traces → "Add to dataset"

Start for free →Get a demo

Frequently asked

What assertions can I configure per scenario?

Each scenario supports: contains (output must include a string), not_contains (output must not include a string), regex (output matches a pattern), and max_latency_ms (response must arrive within N milliseconds). You can also attach a custom LLM evaluator for scoring beyond simple assertions.

How do I run a simulation before deploying a prompt change?

Create a dataset from production traces (or manually). Set up a prompt scenario with the new prompt version and your regression dataset. Run the batch — if any assertions fail or eval scores drop, you see it before any code ships. This is your LLM CI/CD pipeline.

What's the difference between a simulation run and an evaluation run?

Simulation runs test a specific scenario against your own application endpoint or prompt template — they're end-to-end tests you control. Evaluation runs score existing traces using an LLM judge — they measure quality after the fact. They're complementary: simulations for pre-deploy testing, evaluations for ongoing production monitoring.

Can I run simulations against a live HTTP endpoint?

Yes. HTTP scenarios call any URL you configure, send the dataset item as input, receive the response, and evaluate it with your assertion config or custom LLM evaluator. Useful for testing a staging environment before promoting to production.

Explore more features

Setup takes under 5 minutes. Works with OpenAI, Anthropic, LangChain, and more.

Get started free →Get a demo

← All features

Test your AI app against real data before deploying.

3 Scenario Types

Datasets

Trace-to-Dataset

Batch Runs & Progress

What assertions can I configure per scenario?

How do I run a simulation before deploying a prompt change?

What's the difference between a simulation run and an evaluation run?

Can I run simulations against a live HTTP endpoint?

Tracing

Agent Monitoring

Evaluations

Guardrails