Blog

Agent benchmarks

Tokens per useful wakeup

We have been building a small benchmark around proactive email watchers. The early result is simple: useful wakeups are the right unit of measurement. On one 50-event slice, Watchline plus downstream OpenClaw used 68.2% fewer source calls, 50.0% fewer downstream agent calls, and 91.0% fewer downstream agent tokens than a polling baseline, while staying close on F1. The benchmark repo will be published soon.

By Harpinder Singh8 min readPublished 2026-05-23Updated 2026-05-23
Token efficiency scene where many token blocks are filtered down to a few useful wakeup cards.

Short answer

We have been building a small benchmark around proactive email watchers. The early result is simple: useful wakeups are the right unit of measurement. On one 50-event slice, Watchline plus downstream OpenClaw used 68.2% fewer source calls, 50.0% fewer downstream agent calls, and 91.0% fewer downstream agent tokens than a polling baseline, while staying close on F1. The benchmark repo will be published soon.

Key takeaways

Why we needed another metric

Reactive assistants have a natural accounting boundary: the user asks, the model responds. Proactive agents do not. They may inspect state many times before producing one useful interruption.

That changes the cost model. If an agent wakes 100 times and only 3 wakeups are useful, the important number is not the price of one model call. It is the total cost required to create those 3 useful interventions.

Recent proactive-agent benchmarks are moving in this direction. ProAgentBench studies whether agents intervene at the right time in real computer-use sessions: https://arxiv.org/abs/2602.04482. PROEVENT focuses on event-centric proactive assistance and reports false detection and missed-need behavior: https://openreview.net/forum?id=wypdOy0HrM. KnowU-Bench includes proactive mobile-assistant tasks in personalized settings: https://arxiv.org/abs/2604.08455.

Those benchmarks are valuable because they treat timing as a first-class problem. For Watchline, we wanted a narrower infrastructure question: how much agent work can be avoided if future conditions are matched before the expensive assistant wakes?

WatchBench Email v0

The current WatchBench Email v0 artifact uses a synthetic but realistic inbox stream for Priya Nair, an engineering manager at a platform company. The full dataset contains:

The watch intents are the kind of things users actually ask assistants to monitor: production incidents, customer escalations, recruiting updates, vendor threads, CI/CD alerts, billing issues, and internal coordination.

A sample watch looks like this: match internal loopstack.io messages about production incidents, outages, degradation, or customer-impacting reliability issues. A sample event might be a GitHub Actions build-passed email that should not trigger the watch. The benchmark is mostly about saying no cheaply and preserving evidence when the answer is yes.

The 50-event slice

For the first reportable slice, we evaluated the first 50 events against the first 5 watches: 250 watch-event pairs with 19 positives. The comparison used 60-minute simulated polling or pull ticks.

The Watchline path used a local match layer with downstream OpenClaw for matched deliveries. In that slice:

The baseline scored better on pure label quality. That matters, and we should say it plainly. But the cost side moved in the other direction:

That is the shape we care about. The product question is not "can we avoid every model call?" It is "can we spend expensive agent work only when the evidence suggests the user would value the interruption?"

What the mini run taught us

We also ran a smaller high-positive-density slice: 100 events, 5 watches, 500 labels, 77 positives. On a 5-event downstream run, both paths achieved F1 1.0. Watchline used fewer source and downstream calls, but more downstream tokens than the baseline.

That is useful, not embarrassing. If nearly everything is relevant, prefiltering has less room to help. Benchmarks need positive-rate regimes because proactive systems behave differently in sparse inboxes, noisy inboxes, and incident-heavy periods.

One 10-event probe exposed a concrete false positive: a to versus cc precision miss around email_0010. That is the kind of failure a useful benchmark should surface. It tells us where the matcher needs sharper field semantics before downstream delivery.

What we will publish next

The benchmark repo is not public yet. We are still cleaning the dataset card, evaluation scripts, and reporting format so the methodology is reproducible instead of merely interesting.

When it is published, the repo should include the stream, watches, events, labels, runner scripts, and reports. It should also make caveats visible:

FAQ

What is a useful wakeup?

A useful wakeup is a true positive interruption: the system observed a future event, matched it to a durable user intent, and delivered something the downstream agent should actually handle.

Why not benchmark only precision and recall?

Precision and recall are necessary, but incomplete. A proactive system can have strong F1 while spending too much on empty checks. Cost, latency, and false wakeups are part of the product behavior.

Is WatchBench Email v0 a final leaderboard?

No. It is a V1 research artifact. The early numbers are useful for direction, but the public repo should make reproduction, caveats, and comparison rules explicit before anyone treats it as a leaderboard.

Further reading

Measure the cost of deciding when to think, not just the answer after thinking.

Start with Watchline