mike@mikeharty.com:~ — zsh

~ $ imgcat mikeharty.png

Mike's portrait

~ $ imgcat mikeharty.png

Mike's portrait

~ $ ./mikeharty.sh

  _____        _ __       __            __      
 / ___ \__ _  (_) /_____ / /  ___  ____/ /___ __
/ / _ `/  ' \/ /  '_/ -_) _ \/ _ `/ __/ __/ // /
\ \_,_/_/_/_/_/_/\_\\__/_//_/\_,_/_/  \__/\_, / 
 \___/ rigor ≫ ritual                    /___/

~ $ mdcat about.md

Welcome to my humble website. I've been building things online since 1993 — that's all but 7 of my 39 years. I'm driven by a restless need to create and endless curiosity, which has taken me on many adventures. Take a look around, and feel free to reach out if you'd like to work together.

~ $ mdcat hobbies.md

cycling · hiking · 3d printing · cooking · snowboarding

~ $ mdcat links.md

github linkedin email
mike@mikeharty.com:~/work — zsh

~/work $ mdcat summary.md

real-time AI/ML inference, streaming media/data infrastructure, distributed systems at scale.

~/work $ mdcat 2025.md

2025 → Founder, principal engineer

Launching June '26

~/work $ mdcat 2024.md

2024–25 Staff engineer — streaming

Video encoding pipeline — quality improvements, cost reduction.

~/work $ mdcat 2023.md

2023–24 Staff engineer — live media

Backend services for live linear video delivery.

~/work $ mdcat 2022.md

2022–23 Software architect

Analytics platforms and payments infrastructure.

~/work $ mdcat 2021.md

2021–22 Principal engineer — video platform

VOD and live video at global scale.

~/work $ mdcat 2020.md

2019–21 Staff engineer — platform & auth

Identity, authorization, and event-driven platform services.

~/work $ mdcat 2018.md

2018–19 Staff engineer — video infra

Resilient live video ingest and delivery systems. US 11,196,785 B2

~/work $ mdcat before.md

<2018 The earlier years

Distributed systems, media infrastructure, and developer platforms.

~/work $ echo $TOOLBOX

go · rust · typescript · python · c++ · aws · gcp · k8s · kafka · postgres

mike@mikeharty.com:~/projects — zsh

~/projects $ mdcat nexste/README.md

2026 nexste case study

Soon: AI-driven job search cockpit. See case study for details.

~/projects $ mdcat mikeharty.com/README.md

2026 mikeharty.com

The rain is real.

~/projects $ mdcat refinery/README.md

2026 refinery

Iterative K-of-N ref selection with human feedback for voice cloning and custom TTS training.

~/projects $ mdcat esphome-pinout/README.md

2025 esphome pinout

Render ESP32/ESP8266/RP2040 pinouts in a VSCode pane, with mapping and pin info for ESPHome.

~/projects $ mdcat cronify/README.md

2025 cronify

Rust CLI that turns natural language into cron expressions.

~/projects $ mdcat translated.mov/README.md

WIP translated.mov soon

Babel for video.

mike@mikeharty.com:~/work/nexste/model-evaluation-harness.md — mdcat

$ mdcat nexste/model-eval.md

Building an eval harness for AI workflows

Nexste is an unlaunched (open beta July 2026) product I'm building for job search. It's a closed-loop workspace for discovering roles from direct sources, ranking them against a candidate profile, generating and tailoring artifacts, assisting with the application workflow, tracking application and interview communications over email, and turning those communications into clear next steps on a canonical job record.

01 / Context

Making model changes easier to trust

Nexste uses LLMs in a few different places: finding roles, scoring fit, tailoring artifacts, reading email threads, and pulling out next steps. Once those pieces started depending on each other, model changes needed clearer feedback than "this seems better." The system needed a way to show when a prompt or model change made one part better and another part worse.

Each workflow has its own version of "good." Discovery cares about source freshness. Ranking needs grounded evidence, not just a confident-sounding score. Resume tailoring has to stay truthful to the candidate. Email parsing needs to handle messy threads without inventing context. Structured outputs need to stay structured even when the input is thin or weird.

The harness is a way to compare those tradeoffs directly. It checks groundedness, schema shape, reasoning quality, cost, latency, source freshness, and failure behavior. Sometimes the best model for one lane is overkill for another. Sometimes the cheaper model is fine until the evidence gets ambiguous.

That makes model changes easier to reason about. The evals show what improved, what regressed, and where to look before the change reaches production.

02 / System

Keeping the workflow tied to one job record

Most of the workflow is easier to reason about when every signal about a role has one place to land. Search results, scorecards, resume variants, application status, email threads, interview signals, and follow-up actions all attach to the same canonical job record.

That record lets the pieces feed each other without losing context. Ranking can use discovery evidence. Artifact generation can use the ranked record. Communication tracking can update the same job instead of creating a parallel state that has to be reconciled later.

Nexste closed-loop workflow Discovery flows into ranking, artifact generation, communication tracking, and a canonical job record that feeds back into ranking. feedback into ranking Discovery official/direct sources Ranking profile + scorecards Artifacts resume + apply assist Comms threads + next steps Canonical job record durable loop state
The canonical job record keeps search, ranking, artifacts, communications, and follow-up decisions attached to the same loop.
Model evaluation dashboard overview.

03 / Lanes

Choosing models by workflow lane

The harness compares candidate models by lane instead of trying to pick one global winner. Source disambiguation, job-list narrowing, email labeling, and next-action decisions have different risk profiles. A model can be worth its cost when it is resolving ambiguous sources, but unnecessary for simpler filtering. Another model may be fine for labeling an email and too risky when the next action affects a user's schedule or follow-up.

Each lane uses a configurable rubric that weighs accuracy, cost, latency, schema reliability, and failure behavior. The useful answer is not "best model overall." It is which model is reliable enough for this workflow at a cost that makes sense.

Lane Primary risk
Discovery Source freshness, canonical URL selection (which of seven Amazon postings is the live one?), grounded evidence.
Ranking Valid record IDs, bounded scores, dealbreaker handling, rationale grounded in the actual posting.
Comms tracking Matching threads to the right job, classifying stage and next action, handling duplicate-thread ambiguity, telling “recruiter screen scheduled” apart from “rejection wrapped in pleasantries”.
Artifact import Correction handling, hallucination control, signaling when a human should look.
Lane-specific routing makes cost, latency, and failure behavior visible.

04 / Operation

Running cheap checks before live model spend

Evaluation runs in three modes. Static checks config, environment wiring, schema shape, and call-site readiness without any provider calls. Dry-run builds fixture payloads and estimates cost. Live mode only runs with explicit confirmation and a budget cap.

That separation keeps routine regressions cheap to catch. Schema drift, bad config, and broken artifacts show up before money is spent. Live runs are saved for the cases where provider behavior is the thing being measured.

Score, cost, and latency are considered together.
Record evidence is inspectable when a result needs review.

05 / Artifacts

Leaving useful evidence behind

A pass/fail result is too thin to debug. Each run leaves behind an interactive dashboard for lane outcomes and record drilldowns, a fixture-review UI for expected behavior and reviewer annotations, a markdown summary recommendation, and produces metrics for score, cost, latency, status, and schema failures.

Per-record JSON evidence and CSV score sheets are generated for deeper review and spreadsheet work. A Promptfoo config is also generated for broader workflow integration. The point is to make each run easy to inspect, compare, and repeat as prompts, models, providers, or fixtures change.

  • Review surface Interactive dashboard Lane outcomes, score/cost/latency comparisons, and per-record drilldowns.
  • Human review Fixture review UI Expected behavior, reviewer annotations, ambiguous cases, and source conflicts.
  • Decision log Markdown recommendation Promote, continue evaluation, or hold with the evidence behind that decision.
  • Forensics CSV and JSON evidence Record-level inputs, outputs, scores, failures, and spreadsheet-ready score sheets.
  • Operations Prometheus-safe metrics Bounded labels for lane, provider family, model label, status, and evaluator.
  • Interop Promptfoo config Generated evaluation config for teams or workflows that want external runner coverage.

06 / Discipline

Keeping metrics bounded

Metrics are kept intentionally small. They include bounded dimensions such as lane, provider family, model label, status, and evaluator. Fixture IDs, input text, URLs, thread IDs, artifact paths, and failure reasons stay out of metric labels.

The metrics answer questions that need a quick operational read: which lane is regressing, which candidate is failing schema validation, which route got slower, and which decision state changed. The higher-cardinality evidence stays in JSON artifacts, CSV rows, dashboards, and review tools.

Reviewers inspect expected behavior and attach annotations per fixture.

07 / Promotion

Deciding when a model is ready

Promotion is conservative, but it is not just a hard-coded pass/fail gate. A tiny score margin is treated as a reason to keep evaluating, not a reason to promote. Near-tied quality falls back to cost and latency. Schema or safety failures keep a candidate on hold. Thin static or dry-run evidence means the harness keeps collecting signal instead of guessing.

Valid JSON is only part of the story. A model can return a perfectly shaped score and still be confidently wrong about which job to surface. Human review still matters for ambiguous fixtures, source conflicts, and cases where the output is technically valid but product behavior would be wrong.

Future enhancements include widening the evidence base: more edge-case and negative fixtures, regression tracking across prompt/model/provider versions, trace replay from production evidence, a tighter reviewer-annotation loop, and automated promote/rollback gates for lower-risk lanes.

Canonical link LinkedIn
lofi
0:00 / 0:00

Tokyo Sunset

@mikeharty
00:00:00