2026-07 · Essay
Approval surfaces: designing the human loop.
Where the human actually sits in an HITL system, what they see, and how to design the queue so a single reviewer can clear hundreds of decisions an hour without losing the plot. The interaction-design rules we keep coming back to.
You can sell HITL all day in the abstract. The argument is clean: the AI does 95%, the human approves the 5% that matters, accountability stays intact. Everyone nods. Then the system ships, and three weeks in someone notices that the controller's “approval queue” has 4,000 unread items in it and they've been auto-clicking approve for a week.
That isn't a model failure. That's an interaction-design failure. HITL is only as good as the surface where the human meets the AI, and most teams treat that surface as an afterthought, a Slack channel or a generic queue UI bolted on at the end. The result is reviewers who burn out, queues that compound, and approvals that become rubber-stamps. The audit trail is technically intact. The judgement isn't.
We've shipped enough of these to know which surfaces hold up. Here are the rules we keep coming back to.
One human, one decision, one screen
The reviewer needs to see, on a single screen, without scrolling, without context-switching, everything required to say yes or no with confidence. The proposed action. The reasoning. The context the AI used. The explicit risk if it's wrong. A clear primary action and a clear escape hatch.
The trap is to dump “everything we know” on the screen. The discipline is to dump everything that bears on this decision. A controller approving a flagged transaction needs the AI's rationale, the rule it tripped, and a one-click view of the last 3 transactions from this vendor. They don't need the full chart of accounts, the company-wide MRR chart, or a Slack feed. Cognitive load is the throughput limit.
Asymmetric reversibility
Approve and reject should not be the same shape, the same color, and the same pixel distance from the cursor. The action that's harder to undo should be slightly harder to take.
In a finance approval queue, “reject and route to manual review” is cheap. “Approve” commits a transaction that may move money. The reject button gets a single keystroke; the approve button asks for an enter-key confirm or a brief comment. It costs the reviewer two seconds. It costs the operation a $40k mistake when the reviewer is tired.
The other direction matters too. In an outreach queue where rejection means “this lead never gets contacted,” the asymmetry flips: approval is the safe action, rejection deserves the friction. Design the friction toward whichever decision is harder to undo at scale.
Throughput by batching, not by removing review
The temptation when a queue gets long is to remove human review on the “easy” cases. Sometimes that's right. Usually it's the wrong solution to the wrong problem.
The right solution is batching: present similar decisions together so the human can group-act. A controller reviewing 80 flagged transactions should not see 80 modal dialogs. They should see a sortable table grouped by vendor, by category, by deviation magnitude, with the ability to select all rows in a group and approve them together when the pattern is the same. We've seen reviewers go from 12 decisions an hour to 200 with no change in accuracy, just by changing the surface from one-at-a-time to group-by-similarity.
Group-action does not mean rubber-stamp. It means the human reviewed the group, confirmed the pattern, and acted in one motion. The audit trail captures the group decision exactly the same way it would capture 80 individual ones.
Keyboard, always
Power users on review surfaces work with the keyboard or they don't work at high volume. J/K to navigate. A for approve, R for reject (with the asymmetric-reversibility friction noted above). ? for the shortcut overlay. Esc to step back to the list.
This sounds like a power-user nice-to-have. It's actually the difference between “I can clear the queue at end-of-day” and “the queue manages me.” A reviewer who has to mouse to a button on every decision is going to burn out, slow down, or shortcut-by-clicking-fast. Keyboard shortcuts are an accessibility feature for the operating team.
SLAs and ownership, named
Every approval surface needs an owner and a clock. Not “the team”, an actual name. Not “ASAP”, an actual SLA, ideally enforced by the system (escalation after N hours, alert to a manager after M).
When something goes wrong in HITL, the question is always “who was supposed to look at this and didn't?” If the answer is fuzzy, the audit trail is fuzzy, which means the accountability is fuzzy, which means you don't actually have HITL. You have an unowned queue with optimistic SLAs.
We design every approval queue with: primary owner, backup owner, SLA, escalation path, and what happens when nobody acts in time. The most common “what happens” we land on is: revert to the safe default and notify the owner's manager. Almost never “auto-approve.”
Show the AI's reasoning, and its uncertainty
Reviewers approve more accurately when the AI's reasoning is visible. They approve much more accurately when the AI's confidence is visible too. “The model says yes, 92% confidence” is qualitatively different to a reviewer than “the model says yes” alone, even though most of the underlying numbers are similar.
Show confidence honestly. Don't hide it because it's “confusing.” Don't round it to make it look more decisive than it is. The reviewer is the calibration layer; uncertainty is exactly what they need to do their job. We typically show: a recommended action, a one-line rationale, two or three citations to the underlying data, and a confidence band. That's it. No score-out-of-10 vanity. No emoji.
The audit log is a design surface, not a side effect
Every decision that comes out of an approval surface writes a row to an audit log: who decided, what they saw, what the AI proposed, what they did, when, why (if they left a note). If the system is well designed, that log is also a UX: reviewers can see their own history, managers can audit theirs, and a year-later replay is one query away.
We treat the audit log as a first-class artifact, queryable, joinable to outcome data, reviewable by compliance. If you can't reconstruct “why was this approved last March?” in two minutes, your HITL is missing a leg.
Numbers we watch
A few metrics tell you whether your approval surface is healthy:
- Decision rate per reviewer-hour. Should be high and stable. A drop is the early signal of fatigue or an interface problem.
- Override rate. The fraction of AI-recommended actions the reviewer reverses. Too low (under ~5%) means the reviewer isn't actually reading; too high (over ~30%) means the model is undertrained or the surface is showing them too many low-confidence cases. The healthy band is 5–20% in most operations we've worked on.
- Time-in-queue. Median time a decision sits before a human gets to it. Tracks against SLA and is the single best signal for whether you need more reviewers, faster batching, or a different threshold for what hits the queue.
- Post-approval correction rate. Decisions that were approved and then had to be reversed downstream. The bottom-line accuracy signal. If this is creeping up while override rate stays flat, the reviewer is rubber-stamping.
We instrument every approval surface to emit these. They're what we look at in the weekly Operate forum.
The whole point
An approval surface is the only piece of an AI system that the operating team actually uses. The model can be perfect and the integrations can be airtight, and if the surface is a 4,000-row queue with a single approve button, the system fails on contact with reality.
We design the surface as carefully as we design anything else. It's where AI stops being a demo and starts being an operating tool.
If you have a queue people aren't reading
We can help you redesign the surface.
If your team has an HITL system that's technically working but functionally rubber-stamping, that's the problem we like best. A 30-minute call usually surfaces what to change first.