Load Cases
Ingest redacted eval-style records with policy, severity, evaluator, attack-style, and reliability fields.
AI Safety Eval Triage Assistant
A public-safe triage workflow for synthetic eval-style cases. The system scores review priority with transparent reason codes, groups related failures, tracks evaluation-health issues, and exports artifacts for human analysis.
The project turns summarized eval cases into a review workflow: deterministic scoring, tiered queues, cluster recurrence tracking, risk-register drafting, and error analysis for fixture sanity checks.
The public repo uses synthetic, redacted cases only. It does not include live user data, proprietary eval prompts, provider-specific model access, or operational deployment claims.
Ingest redacted eval-style records with policy, severity, evaluator, attack-style, and reliability fields.
Assign deterministic scores and reason codes from severity, policy family, evasion, labels, and recurrence.
Group related cases into risk families so recurring issues can be reviewed together.
Flag missing labels, evaluator disagreement, stale cases, low reliability, and policy blind spots.
Write queues, clusters, registers, reports, casepacks, and error-analysis notes for human review.
severity + policy family
Severity and policy family establish the starting review posture before case-specific signals are added.
evaluator label
Violation, refusal, safe, ambiguous, and unlabeled outcomes influence whether a case needs escalation or calibration.
attack style + evasion
Jailbreak, roleplay, obfuscation, social engineering, dual-use ambiguity, and targeting indicators add context.
signal reliability
Low-reliability or incomplete records can still surface, but the reason codes prevent over-reading weak evidence.
cluster recurrence
Related cases are grouped so a recurring family gets reviewed together instead of as isolated one-offs.
0-100 score
Critical, elevated, watch, and low tiers route records to immediate review, near-term review, calibration, or control.
CRITICAL: 75-100 -> immediate analyst review
ELEVATED: 55-74.9 -> near-term review
WATCH: 35-54.9 -> watchlist or calibration review
LOW: 0-34.9 -> control or low-priority review
outputs/triage_queue.csv
Ranked case list with escalation score, review tier, lane, cluster, reason codes, and summaries.
outputs/risk_clusters.csv
Cluster table for related cases, shared signals, member records, and analyst-facing rationale.
outputs/risk_register.csv
Risk areas with severity, exposure, trajectory, confidence, monitoring signals, and mitigations.
docs/eval_health_heartbeat.md
Run-level review of stale cases, missing labels, disagreement, reliability, and coverage gaps.
docs/demo_casepack.md
Representative cluster writeup showing how related eval cases become a reviewable artifact.
docs/error_analysis.md
Fixture false positives, false negatives, and cluster errors to keep the workflow inspectable.
Nine synthetic cases are routed to immediate analyst review in the generated public run.
Top clusters include fraud and scams, cyber safety, self-harm, violence, and policy-boundary cases.
Scores remain explainable through named factors rather than opaque model judgments.
Generated flags include missing labels, evaluator disagreements, low reliability, stale cases, and blind spots.
Non-low clusters are translated into monitoring signals and mitigation options.
The complete triage run is exported as JSON for inspection or downstream review tooling.
The default data is synthetic and summarized. It demonstrates workflow design, not provider performance or production incident handling.
Public artifacts do not include full prompts, live user data, proprietary evals, external APIs, or model-provider access.
Scores prioritize review. They do not determine policy action, model behavior, account enforcement, or safety conclusions.
Metrics are sanity checks against hand-authored demonstration labels, not calibrated real-world safety measurements.