Feature — Simulations
Test your AI app against real data before deploying.
Run up to 100 scenarios per batch against named datasets. Turn production failures into regression tests in one click.
Prompt, HTTP, and conversation scenarios. Batch runs. Trace-to-dataset conversion.

100
scenarios per batch
500
items per dataset
3
scenario types
3 Scenario Types
Prompt scenarios run a template against input items. HTTP scenarios call an external endpoint and evaluate the response. Conversation scenarios simulate multi-turn exchanges end to end. Attach a default dataset and custom LLM evaluator per scenario.
- Prompt: run a template against each dataset item
- HTTP: call an endpoint and assert on the response
- Conversation: multi-turn simulation with assertions per turn

Datasets
Named datasets hold your test inputs with optional expected outputs and metadata. Add up to 500 items per call. Datasets persist per project and can be reused across multiple simulation scenarios.
- Named datasets: per-project, browsable, deletable
- Items: input, expectedOutput (optional), metadata (optional)
- 500 items per call — bulk import supported

Trace-to-Dataset
Select any production trace and add it to a dataset in one click. Turn real failures, edge cases, and high-cost outliers into regression tests without copy-pasting. Build test coverage from incidents as they happen.
- From Trace Explorer: select traces → 'Add to dataset'
- Mark expected output: annotate what the correct answer should have been
- Instant regression suite: production failures become test cases automatically

Batch Runs & Progress
Run up to 100 scenarios in a single batch with a full dataset as input. Runs execute asynchronously with real-time progress tracking. Tag batches with experiment labels for comparison across changes.
- Up to 100 scenarios per batch, dataset-fed
- Real-time progress: refreshBatchRunProgress for live status
- Experiment labels: tag batches for grouping and head-to-head comparison
Get started
Set up in under 5 minutes
// Create a dataset via API
import { Zespan } from '@zespan/sdk';
const lt = new Zespan({ apiKey: process.env.ZESPAN_API_KEY });
await lt.datasets.addItems('my-regression-suite', [
{ input: 'How do I reset my password?', expectedOutput: '...' },
{ input: 'Cancel my subscription', expectedOutput: '...' },
]);
// Or create from production traces in one click in the UI:
// Trace Explorer → select traces → "Add to dataset"Frequently asked
What assertions can I configure per scenario?
Each scenario supports: contains (output must include a string), not_contains (output must not include a string), regex (output matches a pattern), and max_latency_ms (response must arrive within N milliseconds). You can also attach a custom LLM evaluator for scoring beyond simple assertions.
How do I run a simulation before deploying a prompt change?
Create a dataset from production traces (or manually). Set up a prompt scenario with the new prompt version and your regression dataset. Run the batch — if any assertions fail or eval scores drop, you see it before any code ships. This is your LLM CI/CD pipeline.
What's the difference between a simulation run and an evaluation run?
Simulation runs test a specific scenario against your own application endpoint or prompt template — they're end-to-end tests you control. Evaluation runs score existing traces using an LLM judge — they measure quality after the fact. They're complementary: simulations for pre-deploy testing, evaluations for ongoing production monitoring.
Can I run simulations against a live HTTP endpoint?
Yes. HTTP scenarios call any URL you configure, send the dataset item as input, receive the response, and evaluate it with your assertion config or custom LLM evaluator. Useful for testing a staging environment before promoting to production.
Start free — 10K traces/month, no card needed
Setup takes under 5 minutes. Works with OpenAI, Anthropic, LangChain, and more.