ES EN
How it works

Reports & dashboard

How to read evaluation results, where to track trends, and how to export reports for auditing.

General dashboard

The Dashboard is the entry view after login. It summarizes what's happening in your organization:

Plan usage (tokens consumed, AI generations, etc.) and plan management live in Billing & Plan — see Plans & pricing → Billing & Plan.

Evaluation Report (per Run)

Each evaluated Run produces an Evaluation Report. Go to Execution → Evaluations and click on the run.

Evaluation Report with pass rate, per-test-case performance, and per-evaluator scores
Evaluation Report structure: pass rate against threshold, bar per test case, and the grid with scores for each activated evaluator.

What it shows

Export to PDF

PDF Report button at the top. Generates a PDF with:

The PDF is meant to circulate within the team or as a deliverable for stakeholders.

AI-powered enhanced report (Enterprise)

Enterprise plan only. On top of the Evaluation Report, the platform adds automatic AI analysis:

The enhanced report attaches to the standard PDF.

Trends and run comparison

The dashboard shows the evolution of overall score across runs. This lets you see whether quality is improving, holding, or degrading between releases.

Each individual run can be inspected and exported.

Immutable snapshots and reproducibility

The system keeps two levels of immutable "snapshots":

This means a report generated 6 months ago looks identical today, even if you changed Test Cases, the AI agent, or the evaluators. It's the foundation of auditability.

Reproducibility: don't expect identical results

The snapshot is immutable, but if you re-run the same Test Plan, the AI agent's responses may be different because there's an LLM behind the AI agent, and LLMs aren't deterministic. Same applies to evaluation: if you re-evaluate the same Run, scores may change because there's also an LLM behind each evaluator.

You can run N different evaluations on the same Run over time (with different evaluators activated, or simply to re-score). They're all stored as independent snapshots.

Score Overrides: manual correction with audit trail

If you disagree with the score an evaluator gave for a specific case, you can edit it manually. The platform:

The full list of overrides applied in your organization is at Execution → Score Overrides. This keeps full auditability: modifications exist but they're traced.

How to read a score

Evaluators return values between 0 and 1 internally, but the platform shows them on screen as a percentage (0–100%) for easier reading. General guideline:

The pass/fail threshold is configurable per evaluator at Configuration → Evaluators/Judges. Each evaluator ships with a sensible default threshold that you can tune to your needs.

💡 Best practice. Look at the overall score first. Then open the worst-performing evaluators. Then read the 3–5 worst-scored cases of those evaluators. That gives you 80% of the insight for 20% of the effort.

Sharing a report

Two ways:

Members of your organization can also open any run directly by URL — just copy it from your browser's address bar and send it; the recipient needs to be logged in and a member of the same organization to see it. There's no dedicated "Share" button today.

Next step

If you want to see how to connect ArtificialQA with other tools (CI, ticket trackers, notifications), check the Integrations section.