ES EN
How it works

Evaluating the responses

17 calibrated LLM evaluators that judge non-deterministic responses, with optional deterministic asserts as a complement for exact checks.

How evaluation works

Responses from AI agents (chatbots, assistants, copilots) are non-deterministic: the same question can produce several valid answers with different quality, tone, or accuracy. ArtificialQA's evaluation is built specifically for that — judging language with language, using calibrated LLM evaluators.

When you also need exact checks (a specific code, a phrase, a response time, a JSON shape), you can add deterministic asserts as a complement.

🧠
LLM evaluators — the primary layer
A model scores each response on a specific dimension (tone, accuracy, hallucination, etc.). Returns a 0–1 decimal score + textual justification. 17 calibrated evaluators.
🔧
Deterministic asserts — optional complement
Programmatic checks: contains, regex, JSON Schema, response_time, etc. No AI involved. Cheap, instant, 100% reproducible. Useful when something must match exactly.

How to think about it:

The 17 LLM evaluators

Each evaluator is a tuned prompt plus a model. They internally return a decimal score between 0 and 1, which the platform displays on screen as a percentage (0–100%), along with a textual explanation.

SlugWhat it measures
comparisonComparison between actual response and expected response (semantic similarity).
completenessWhether the response covers everything the question asks.
concisenessWhether the response is brief enough without losing the essential.
formalityAppropriateness of formal/informal register to the context.
biasPresence of bias (gender, racial, age, etc.).
toneTone appropriate to channel and user.
empathyEmpathy level when the user expresses frustration or vulnerability.
securitySecurity risks (data leakage, prompt injection, etc.).
inappropriate_contentInappropriate, offensive, or out-of-scope content.
error_handlingHow it handles errors, user ambiguity, or unexpected inputs.
ambiguityWhether the response resolves or introduces ambiguity.
fluencyNaturalness and fluency of the language.
data_accuracyAccuracy of specific data mentioned (prices, dates, numbers).
hallucinationMade-up information not backed by context.
escalationWhether it correctly escalates to human when appropriate.
languageThat the response is in the expected language.
consistencyInternal coherence and across the conversation.

How a run is evaluated

After running a Test Plan, completed Runs appear in the Ready to Evaluate tab on the Evaluations page. To evaluate:

  1. Go to Execution → Evaluations.
  2. Select the Run.
  3. Click Evaluate.
  4. Pick which evaluators to activate (not necessarily all 17 — use the ones that make sense for your domain).
  5. Click Start Evaluation.

Each Run response gets passed to each selected evaluator, in parallel. When it finishes, the Run moves to Evaluated state and the report becomes available.

💰 Tokens. The evaluation phase consumes tokens from your plan's pool. The more active evaluators × more responses to evaluate, the more consumption.

Score, weight, and how pass/fail is decided

Each response × each evaluator produces:

📊 Reading the scores: higher is always better. Every evaluator normalizes its output so that a score closer to 1 (100%) always means "good", regardless of what the dimension measures. This applies even to evaluators whose name might suggest the opposite:
  • Bias: 100% means no bias detected (not "lots of bias").
  • Security: 100% means safe, no security issues.
  • Inappropriate content: 100% means no inappropriate content.
  • Hallucination: 100% means no hallucination detected (the response stuck to the facts).
This way the weighted average always tends toward 1 (100%) when the agent is doing well — making the pass/fail logic uniform across all 17 evaluators.
Evaluation Report with scores per evaluator
Evaluation view — pass rate, threshold, per-test-case performance, and a circle for each of the 17 evaluators (Consistency 100%, Inappropriate Content 100%... Hallucination 16%).

Pass/fail is computed at two levels, both controlled by organization-level thresholds (defaults: 0.70 for case, 0.80 for plan):

  1. Per test case: weighted average of all activated evaluator scores. If avg ≥ caseThreshold → the test case passes. Each evaluator has a configurable weight (set at Configuration → Evaluators/Judges) that determines how much it influences the case score.
  2. Per plan: number of passed test cases divided by total. If pass rate ≥ planThreshold → the plan passes.

Both thresholds are configured in the same panel — Configuration → Evaluators/Judges, under "Evaluation Thresholds" — as two sliders (defaults: 70% for case, 80% for plan).

There is no individual pass/fail per evaluator. Scores are shown in color bands purely as a visual aid — green when the score is at or above the case threshold, amber for borderline values, red below — but only the weighted average drives the actual verdict.

The evaluators come pre-calibrated

An LLM evaluator isn't trustworthy by default — different models, prompts, and temperatures yield different scores. That's why the 17 evaluators you use in ArtificialQA come pre-calibrated by our team: we validate each one against reference datasets to make sure their scores are reliable and consistent. You don't have to calibrate anything — it's ready to use.

More detail on the internal calibration process in Security & compliance.

Automatic PII detection

Independent of the LLM evaluators, ArtificialQA automatically scans every test case for PII (personally identifiable information):

When PII is detected, the test case is marked with a small shield icon in the test case list, and you can filter the list by "PII detected" to review them quickly. Today this works as an advisory signal — it doesn't affect pass/fail and isn't included in evaluation reports.

When to use which evaluator

For reference and as a guideline, these are typical combos by domain. They are not mandatory or exclusive — you choose which evaluators to activate based on the dimensions that matter for your agent:

You can combine in any way, and tune as you gain experience from your first runs.

Re-evaluating and Score Overrides

You can run multiple evaluations over time on the same Run — with different evaluators activated, or simply to re-score. Each evaluation is stored as an independent snapshot. Because there's an LLM behind each evaluator, two evaluations of the same run can yield different scores.

If you disagree with a specific score, you can edit it manually (Score Override). The platform flags the score as "modified", keeps the original in history, and logs who, when, and why. Auditability stays intact. The full list of overrides lives at Execution → Score Overrides.

Next step

Once the Run is evaluated, the next step is interpreting the results. That's covered in the Reports & dashboard section.