# Tokens per useful wakeup

A first look at WatchBench Email v0 and why proactive-agent benchmarks should measure useful interruptions, not only task success.

Published: 2026-05-23
Updated: 2026-05-23
Canonical: https://watch.qordinate.ai/blog/tokens-per-useful-wakeup
Markdown: https://watch.qordinate.ai/blog/tokens-per-useful-wakeup.md
Author: Harpinder Singh
Author URL: https://www.linkedin.com/in/singhcoder/
Image: https://watch.qordinate.ai/images/blog/tokens-per-useful-wakeup.jpg

Tags:
- WatchBench
- benchmarks
- proactive agents
- token cost

## Short answer

We have been building a small benchmark around proactive email watchers. The early result is simple: useful wakeups are the right unit of measurement. On one 50-event slice, Watchline plus downstream OpenClaw used 68.2% fewer source calls, 50.0% fewer downstream agent calls, and 91.0% fewer downstream agent tokens than a polling baseline, while staying close on F1. The benchmark repo will be published soon.

## Key takeaways

- Proactive-agent cost should be measured per useful wakeup, not per chat turn.
- A polling baseline can score well while spending too many calls on non-events.
- WatchBench Email v0 evaluates future conditions across inbox events and watch intents.
- The current artifact is a research slice, not a final leaderboard.
- The most important caveat is visible: matching cost, false positives, and latency all need to be reported together.

## Why we needed another metric

Reactive assistants have a natural accounting boundary: the user asks, the model responds. Proactive agents do not. They may inspect state many times before producing one useful interruption.

That changes the cost model. If an agent wakes 100 times and only 3 wakeups are useful, the important number is not the price of one model call. It is the total cost required to create those 3 useful interventions.

Recent proactive-agent benchmarks are moving in this direction. ProAgentBench studies whether agents intervene at the right time in real computer-use sessions: https://arxiv.org/abs/2602.04482. PROEVENT focuses on event-centric proactive assistance and reports false detection and missed-need behavior: https://openreview.net/forum?id=wypdOy0HrM. KnowU-Bench includes proactive mobile-assistant tasks in personalized settings: https://arxiv.org/abs/2604.08455.

Those benchmarks are valuable because they treat timing as a first-class problem. For Watchline, we wanted a narrower infrastructure question: how much agent work can be avoided if future conditions are matched before the expensive assistant wakes?

## WatchBench Email v0

The current WatchBench Email v0 artifact uses a synthetic but realistic inbox stream for Priya Nair, an engineering manager at a platform company. The full dataset contains:

- 500 chronological email events in one inbox stream;
- 20 fully resolved watch intents;
- 10,000 watch-event labels;
- 412 positive labels;
- a 4.1% positive label rate.

The watch intents are the kind of things users actually ask assistants to monitor: production incidents, customer escalations, recruiting updates, vendor threads, CI/CD alerts, billing issues, and internal coordination.

A sample watch looks like this: match internal `loopstack.io` messages about production incidents, outages, degradation, or customer-impacting reliability issues. A sample event might be a GitHub Actions build-passed email that should not trigger the watch. The benchmark is mostly about saying no cheaply and preserving evidence when the answer is yes.

## The 50-event slice

For the first reportable slice, we evaluated the first 50 events against the first 5 watches: 250 watch-event pairs with 19 positives. The comparison used 60-minute simulated polling or pull ticks.

The Watchline path used a local match layer with downstream OpenClaw for matched deliveries. In that slice:

- Watchline plus downstream OpenClaw: 17 true positives, 2 false positives, 2 false negatives, precision 0.895, recall 0.895, F1 0.895.
- OpenClaw polling baseline: 18 true positives, 0 false positives, 1 false negative, precision 1.000, recall 0.947, F1 0.973.

The baseline scored better on pure label quality. That matters, and we should say it plainly. But the cost side moved in the other direction:

- source calls fell from 157 to 50, a 68.2% reduction;
- downstream agent calls fell from 42 to 21, a 50.0% reduction;
- downstream agent tokens fell from 508,583 to 45,904, a 91.0% reduction.

That is the shape we care about. The product question is not "can we avoid every model call?" It is "can we spend expensive agent work only when the evidence suggests the user would value the interruption?"

## What the mini run taught us

We also ran a smaller high-positive-density slice: 100 events, 5 watches, 500 labels, 77 positives. On a 5-event downstream run, both paths achieved F1 1.0. Watchline used fewer source and downstream calls, but more downstream tokens than the baseline.

That is useful, not embarrassing. If nearly everything is relevant, prefiltering has less room to help. Benchmarks need positive-rate regimes because proactive systems behave differently in sparse inboxes, noisy inboxes, and incident-heavy periods.

One 10-event probe exposed a concrete false positive: a `to` versus `cc` precision miss around `email_0010`. That is the kind of failure a useful benchmark should surface. It tells us where the matcher needs sharper field semantics before downstream delivery.

## What we will publish next

The benchmark repo is not public yet. We are still cleaning the dataset card, evaluation scripts, and reporting format so the methodology is reproducible instead of merely interesting.

When it is published, the repo should include the stream, watches, events, labels, runner scripts, and reports. It should also make caveats visible:

- the Watchline matching cost is not yet modeled as a paid model call;
- synthetic email data is safer for publishing but cannot fully represent real inbox messiness;
- latency depends on polling/pull cadence and should be reported beside accuracy;
- token savings are only meaningful when precision remains high enough to preserve trust.

## FAQ

### What is a useful wakeup?

A useful wakeup is a true positive interruption: the system observed a future event, matched it to a durable user intent, and delivered something the downstream agent should actually handle.

### Why not benchmark only precision and recall?

Precision and recall are necessary, but incomplete. A proactive system can have strong F1 while spending too much on empty checks. Cost, latency, and false wakeups are part of the product behavior.

### Is WatchBench Email v0 a final leaderboard?

No. It is a V1 research artifact. The early numbers are useful for direction, but the public repo should make reproduction, caveats, and comparison rules explicit before anyone treats it as a leaderboard.

## Further reading

- https://arxiv.org/abs/2602.04482
- https://openreview.net/forum?id=wypdOy0HrM
- https://arxiv.org/abs/2604.08455
- https://docs.litellm.ai/docs/completion/token_usage
- https://svix.com/resources/faq/webhooks-vs-api-polling
