AI Safety Eval Triage Assistant

Review queues, risk clusters, and eval-health reporting for synthetic, redacted AI safety cases.

A public-safe triage workflow for synthetic eval-style cases. The workflow scores review priority with transparent reason codes, groups related failures, tracks evaluation-health issues, and exports artifacts for human analysis.

The project turns summarized eval cases into a review workflow: deterministic scoring, tiered queues, cluster recurrence tracking, risk-register drafting, and error analysis for fixture sanity checks.

The public repo uses synthetic, redacted cases only. It does not include live user data, proprietary eval prompts, provider-specific model access, or operational deployment claims.

Data: Synthetic eval cases with policy family, severity, labels, attack style, and reliability.
Scoring: Deterministic 0-100 review priority with reason codes and tiered review posture.
Outputs: Summary, triage queue, risk clusters, risk register, heartbeat, casepack, and error analysis.
Posture: Human review workflow, not automated enforcement or provider benchmarking.

18 Synthetic cases

12 Risk clusters

7 Risk-register entries

8 Eval-health flags

9 Critical queue cases

25 Automated tests

Workflow

Load Cases

Ingest redacted eval-style records with policy, severity, evaluator, attack-style, and reliability fields.

Score Review Priority

Assign deterministic scores and reason codes from severity, policy family, evasion, labels, and recurrence.

Cluster Failures

Group related cases into risk families so recurring issues can be reviewed together.

Track Eval Health

Flag missing labels, evaluator disagreement, stale cases, low reliability, and policy blind spots.

Export Artifacts

Write queues, clusters, registers, reports, casepacks, and error-analysis notes for human review.

Scoring Logic

severity + policy family

Base Risk

Severity and policy family establish the starting review posture before case-specific signals are added.

evaluator label

Outcome Signal

Violation, refusal, safe, ambiguous, and unlabeled outcomes influence whether a case needs escalation or calibration.

attack style + evasion

Adversarial Signals

Jailbreak, roleplay, obfuscation, social engineering, dual-use ambiguity, and targeting indicators add context.

signal reliability

Quality Controls

Low-reliability or incomplete records can still surface, but the reason codes prevent over-reading weak evidence.

cluster recurrence

Pattern Review

Related cases are grouped so a recurring family gets reviewed together instead of as isolated one-offs.

0-100 score

Review Tier

Critical, elevated, watch, and low tiers route records to immediate review, near-term review, calibration, or control.

CRITICAL: 75-100  -> immediate analyst review
ELEVATED: 55-74.9 -> near-term review
WATCH:    35-54.9 -> watchlist or calibration review
LOW:       0-34.9 -> control or low-priority review

Implemented Components

outputs/triage_queue.csv

Review Queue

Ranked case list with escalation score, review tier, lane, cluster, reason codes, and summaries.

outputs/risk_clusters.csv

Risk Clusters

Cluster table for related cases, shared signals, member records, and analyst-facing rationale.

outputs/risk_register.csv

Risk Register

Risk areas with severity, exposure, trajectory, confidence, monitoring signals, and mitigations.

docs/eval_health_heartbeat.md

Eval Health

Run-level review of stale cases, missing labels, disagreement, reliability, and coverage gaps.

docs/demo_casepack.md

Casepack

Representative cluster writeup showing how related eval cases become a reviewable artifact.

docs/error_analysis.md

Error Analysis

Fixture false positives, false negatives, and cluster errors to keep the workflow inspectable.

Outputs

Critical Queue

Nine synthetic cases are routed to immediate analyst review in the generated public run.

Cluster Recurrence

Top clusters include fraud and scams, cyber safety, self-harm, violence, and policy-boundary cases.

Reason Codes

Scores remain explainable through named factors rather than opaque model judgments.

Eval Health Flags

Generated flags include missing labels, evaluator disagreements, low reliability, stale cases, and blind spots.

Risk Register

Non-low clusters are translated into monitoring signals and mitigation options.

Serialized Run

The complete triage run is exported as JSON for inspection or downstream review scripts.

Scope And Limitations

Synthetic Fixture

The default data is synthetic and summarized. It demonstrates workflow design, not provider performance or production incident handling.

Redacted Surface

Public artifacts do not include full prompts, live user data, proprietary evals, external APIs, or model-provider access.

Human Review

Scores prioritize review. They do not determine policy action, model behavior, account enforcement, or safety conclusions.

Fixture Metrics

Metrics are sanity checks against hand-authored demonstration labels, not calibrated real-world safety measurements.

GitHub