$ mdcat nexste/model-eval.md
Building an eval harness for AI workflows
Nexste is an unlaunched (open beta July 2026) product I'm building for job search. It's a closed-loop workspace for discovering roles from direct sources, ranking them against a candidate profile, generating and tailoring artifacts, assisting with the application workflow, tracking application and interview communications over email, and turning those communications into clear next steps on a canonical job record.
01 / Context
Making model changes easier to trust
Nexste uses LLMs in a few different places: finding roles, scoring fit, tailoring artifacts, reading email threads, and pulling out next steps. Once those pieces started depending on each other, model changes needed clearer feedback than "this seems better." The system needed a way to show when a prompt or model change made one part better and another part worse.
Each workflow has its own version of "good." Discovery cares about source freshness. Ranking needs grounded evidence, not just a confident-sounding score. Resume tailoring has to stay truthful to the candidate. Email parsing needs to handle messy threads without inventing context. Structured outputs need to stay structured even when the input is thin or weird.
The harness is a way to compare those tradeoffs directly. It checks groundedness, schema shape, reasoning quality, cost, latency, source freshness, and failure behavior. Sometimes the best model for one lane is overkill for another. Sometimes the cheaper model is fine until the evidence gets ambiguous.
That makes model changes easier to reason about. The evals show what improved, what regressed, and where to look before the change reaches production.
02 / System
Keeping the workflow tied to one job record
Most of the workflow is easier to reason about when every signal about a role has one place to land. Search results, scorecards, resume variants, application status, email threads, interview signals, and follow-up actions all attach to the same canonical job record.
That record lets the pieces feed each other without losing context. Ranking can use discovery evidence. Artifact generation can use the ranked record. Communication tracking can update the same job instead of creating a parallel state that has to be reconciled later.
03 / Lanes
Choosing models by workflow lane
The harness compares candidate models by lane instead of trying to pick one global winner. Source disambiguation, job-list narrowing, email labeling, and next-action decisions have different risk profiles. A model can be worth its cost when it is resolving ambiguous sources, but unnecessary for simpler filtering. Another model may be fine for labeling an email and too risky when the next action affects a user's schedule or follow-up.
Each lane uses a configurable rubric that weighs accuracy, cost, latency, schema reliability, and failure behavior. The useful answer is not "best model overall." It is which model is reliable enough for this workflow at a cost that makes sense.
| Lane | Primary risk |
|---|---|
| Discovery | Source freshness, canonical URL selection (which of seven Amazon postings is the live one?), grounded evidence. |
| Ranking | Valid record IDs, bounded scores, dealbreaker handling, rationale grounded in the actual posting. |
| Comms tracking | Matching threads to the right job, classifying stage and next action, handling duplicate-thread ambiguity, telling “recruiter screen scheduled” apart from “rejection wrapped in pleasantries”. |
| Artifact import | Correction handling, hallucination control, signaling when a human should look. |
04 / Operation
Running cheap checks before live model spend
Evaluation runs in three modes. Static checks config, environment wiring, schema shape, and call-site readiness without any provider calls. Dry-run builds fixture payloads and estimates cost. Live mode only runs with explicit confirmation and a budget cap.
That separation keeps routine regressions cheap to catch. Schema drift, bad config, and broken artifacts show up before money is spent. Live runs are saved for the cases where provider behavior is the thing being measured.
05 / Artifacts
Leaving useful evidence behind
A pass/fail result is too thin to debug. Each run leaves behind an interactive dashboard for lane outcomes and record drilldowns, a fixture-review UI for expected behavior and reviewer annotations, a markdown summary recommendation, and produces metrics for score, cost, latency, status, and schema failures.
Per-record JSON evidence and CSV score sheets are generated for deeper review and spreadsheet work. A Promptfoo config is also generated for broader workflow integration. The point is to make each run easy to inspect, compare, and repeat as prompts, models, providers, or fixtures change.
- Review surface Interactive dashboard Lane outcomes, score/cost/latency comparisons, and per-record drilldowns.
- Human review Fixture review UI Expected behavior, reviewer annotations, ambiguous cases, and source conflicts.
- Decision log Markdown recommendation Promote, continue evaluation, or hold with the evidence behind that decision.
- Forensics CSV and JSON evidence Record-level inputs, outputs, scores, failures, and spreadsheet-ready score sheets.
- Operations Prometheus-safe metrics Bounded labels for lane, provider family, model label, status, and evaluator.
- Interop Promptfoo config Generated evaluation config for teams or workflows that want external runner coverage.
06 / Discipline
Keeping metrics bounded
Metrics are kept intentionally small. They include bounded dimensions such as lane, provider family, model label, status, and evaluator. Fixture IDs, input text, URLs, thread IDs, artifact paths, and failure reasons stay out of metric labels.
The metrics answer questions that need a quick operational read: which lane is regressing, which candidate is failing schema validation, which route got slower, and which decision state changed. The higher-cardinality evidence stays in JSON artifacts, CSV rows, dashboards, and review tools.
07 / Promotion
Deciding when a model is ready
Promotion is conservative, but it is not just a hard-coded pass/fail gate. A tiny score margin is treated as a reason to keep evaluating, not a reason to promote. Near-tied quality falls back to cost and latency. Schema or safety failures keep a candidate on hold. Thin static or dry-run evidence means the harness keeps collecting signal instead of guessing.
Valid JSON is only part of the story. A model can return a perfectly shaped score and still be confidently wrong about which job to surface. Human review still matters for ambiguous fixtures, source conflicts, and cases where the output is technically valid but product behavior would be wrong.
Future enhancements include widening the evidence base: more edge-case and negative fixtures, regression tracking across prompt/model/provider versions, trace replay from production evidence, a tighter reviewer-annotation loop, and automated promote/rollback gates for lower-risk lanes.