Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.knowledgestack.ai/llms.txt

Use this file to discover all available pages before exploring further.


How It Works

The evaluation system runs predefined prompts through the agent and scores the output using both deterministic checks and optional LLM-based judging. Each evaluation case specifies:
  • An input question (what the user asks)
  • Expected behavior (which tools should be used, what keywords should appear in the answer, whether citations are expected)
  • Optionally, a reference answer for LLM-based quality scoring

Evaluation Criteria

Deterministic Checks

These checks produce pass/fail results:
CheckPasses When
No errorThe agent completed without errors
Tools usedThe expected tools were called (e.g., search_knowledge for a search question)
Tools not usedCertain tools were correctly avoided
Answer keywordsRequired keywords appear in the response (case-insensitive)
Answer exclusionsForbidden terms do not appear in the response
Citations presentThe response includes citations when expected

LLM Judge (Optional)

When a reference answer is provided and a judge model is configured, an LLM evaluates the response for correctness and completeness on a 0.0-1.0 scale. Judge failures are logged but do not fail the evaluation.

Evaluation Categories

Evaluations cover a range of agent behaviors:
CategoryWhat Is Tested
NavigationCan the agent browse folders and find content by name?
SearchDoes the agent use the right search tool (semantic vs keyword) for different queries?
ReadingCan the agent read documents and sections, handling both small and large content?
Context expansionDoes the agent expand context around search results when needed?
Citation accuracyAre citations present and pointing to the correct source content?
Multi-step reasoningCan the agent chain multiple tool calls to answer complex questions?

Observability

When an observability platform (Langfuse) is configured, evaluation results are synced as dataset experiments with per-item traces and scores. This provides:
  • Historical tracking of agent quality over time
  • Per-query traces showing exactly what the agent did
  • Score trends across model changes or prompt updates