Zespan is an AI agent observability and engineering platform. It traces every agent decision, tool call, handoff, and delegation in production. It also provides prompt versioning, built-in LLM-as-judge evaluations, guardrails, cost optimization, and an AI ops assistant called ZespanPilot.

How do I instrument my AI agent with Zespan?

Zespan requires 2 lines of code. Import zespan and call zespan.init({ apiKey: process.env.LT_KEY }). This auto-patches OpenAI, Anthropic, Gemini, Bedrock, and Mistral. For framework-level tracing, add one handler: ZespanCallbackHandler for LangChain, ZespanCrewAIListener for CrewAI, or ZespanADKHandler for Google ADK.

Does Zespan support prompt versioning?

Yes. Zespan includes prompt management with versioning, a playground for iteration, and A/B testing to compare prompt versions against each other in production.

What evaluations does Zespan support?

Zespan ships 12 built-in LLM-as-judge evaluation templates including faithfulness, relevance, toxicity, groundedness, and more. Evaluations run automatically on every trace with no custom scoring functions required.

How does Zespan compare to Langfuse?

Zespan is agent-native: every span carries agent identity, delegations are first-class trace events, and an agent map is built automatically. Langfuse was built for LLM pipelines and extended to agents later. Zespan also ships 12 built-in eval templates (Langfuse has none), includes an AI cost optimizer, and ZespanPilot for AI ops. Langfuse has open-source self-hosting; Zespan does not.

What is the free tier for Zespan?

The free tier includes 10,000 traces per month, 14-day retention, 2 projects, and 1 seat. No credit card required.

Use Case — RAG

Find where your RAG pipeline breaks — retrieval, reranking, or synthesis.

Hallucinations don't appear from nowhere. They come from bad retrievals, ignored context, or prompts that drift. Zespan shows you exactly which step is failing.

Start for free →For Teams building retrieval-augmented generation systems

The problem

Hallucinations with no trail

You can see a bad output but not the retrieved context that caused it. Was it a poor retrieval, a bad prompt, or the model ignoring context? You can't tell.

Retrieval quality is a black box

You don't know your average retrieval score, how many chunks are retrieved, or whether reranking actually improves results at scale.

Quality degrades silently

Faithfulness scores slip as your document base grows or prompts change. You find out from user reports, not metrics.

How to use Zespan for this

Open Trace Explorer — see every RAG pipeline run

In Trace Explorer, filter by operation=rag or by the model your pipeline uses. Every pipeline run is a trace. Click any one to open the span waterfall — query encoding, vector retrieval, reranking, and LLM synthesis each appear as a separate timed span.

zespan.com

Zespan trace explorer filtered to RAG operations showing latency, cost, and status

Inspect the span waterfall — see retrieval vs synthesis split

In the span detail view, the waterfall breaks your pipeline into steps. You can see if retrieval is taking 800ms while synthesis takes 200ms, or vice versa. Check retrieved chunk counts and scores per retrieval span to identify when low-score chunks enter the context window.

zespan.com

Zespan span waterfall showing RAG pipeline with retrieval, reranking, and synthesis spans

Check Evaluations — faithfulness on every response

Open Evaluations. If you've enabled the faithfulness auto-evaluator, every RAG response has a score: 1.0 means the answer is fully grounded in retrieved context; 0.0 means it came from model memory. Filter to faithfulness < 0.7 to find the responses where the model hallucinated.

zespan.com

Zespan evaluations showing faithfulness and relevance scores trending over time

Click into failing evaluations — trace the bad retrieval

In the evaluation detail view, click any low-faithfulness trace. You'll see the user question, the retrieved chunks, the synthesized answer, and the judge's reasoning. The gap between the retrieved context and the answer tells you exactly what broke — bad retrieval, low-quality chunks, or the model ignoring context.

zespan.com

Zespan evaluation detail view showing faithfulness score with retrieved context and judge reasoning

Add failing traces to a dataset — regression tests from real failures

Found a trace where retrieval went wrong? Click 'Add to dataset'. Build a dataset of real failures. Next time you change your retrieval config or prompt, run a batch simulation against this dataset — assertions catch regressions before any user sees them.

zespan.com

Zespan trace-to-dataset showing a failing RAG trace added as a regression test case

Zespan features used

Tracing & ObservabilityDeep dive →EvaluationsDeep dive →SimulationsDeep dive →Prompt ManagementDeep dive →

Start free — 10K traces/month, no card needed

See every agent decision, tool call, and handoff in production. Setup takes under 5 minutes.

Get started free →

← Back to home