WatchBench Email v0 is a synthetic email benchmark for measuring when an agent should wake up for a future user intent. On the canonical 50-event, 5-watch slice, Watchline plus downstream OpenClaw used 68.2% fewer source calls, 50.0% fewer downstream agent calls, and 91.0% fewer downstream agent tokens than a polling baseline.

Short answer

The benchmark artifact is available on GitHub. The dataset is also available on Hugging Face.

Key takeaways

Proactive-agent cost should be measured per useful wakeup, not per chat turn.
A polling baseline can score well while spending too many calls on non-events.
WatchBench Email v0 evaluates future conditions across inbox events and watch intents.
The public artifact reports a reproducible slice with exact comparison rules.
The dataset is available as JSONL files in the repo and as loadable splits on Hugging Face.

Why we needed another metric

Reactive assistants have a natural accounting boundary: the user asks, the model responds. Proactive agents do not. They may inspect state many times before producing one useful interruption.

That changes the cost model. If an agent wakes 100 times and only 3 wakeups are useful, the important number is not the price of one model call. It is the total cost required to create those 3 useful interventions.

Recent proactive-agent benchmarks are moving in this direction. ProAgentBench studies whether agents intervene at the right time in real computer-use sessions: https://arxiv.org/abs/2602.04482. PROEVENT focuses on event-centric proactive assistance and reports false detection and missed-need behavior: https://openreview.net/forum?id=wypdOy0HrM. KnowU-Bench includes proactive mobile-assistant tasks in personalized settings: https://arxiv.org/abs/2604.08455.

Those benchmarks are valuable because they treat timing as a first-class problem. WatchBench Email v0 focuses on a narrower infrastructure question: how much agent work can be avoided if future conditions are matched before the expensive assistant wakes?

WatchBench Email v0

WatchBench Email v0 uses a synthetic but realistic inbox stream for Priya Nair, an engineering manager at a platform company. The full dataset contains:

500 chronological email events in one inbox stream;
20 fully resolved watch intents;
10,000 watch-event labels;
412 positive labels;
a 4.1% positive label rate.

The watch intents are the kind of things users actually ask assistants to monitor: production incidents, customer escalations, recruiting updates, vendor threads, CI/CD alerts, billing issues, and internal coordination.

A sample watch looks like this: match internal loopstack.io messages about production incidents, outages, degradation, or customer-impacting reliability issues. A sample event might be a GitHub Actions build-passed email that should not trigger the watch. The benchmark is mostly about saying no cheaply and preserving evidence when the answer is yes.

The 50-event slice

For the first reportable slice, we evaluated the first 50 events against the first 5 watches: 250 watch-event pairs with 19 positives. The comparison used 60-minute simulated polling or pull ticks.

The Watchline path used a local match layer with downstream OpenClaw for matched deliveries. In that slice:

Watchline plus downstream OpenClaw: 17 true positives, 2 false positives, 2 false negatives, precision 0.895, recall 0.895, F1 0.895.
OpenClaw polling baseline: 18 true positives, 0 false positives, 1 false negative, precision 1.000, recall 0.947, F1 0.973.

The cost side moved sharply:

source calls fell from 157 to 50, a 68.2% reduction;
downstream agent calls fell from 42 to 21, a 50.0% reduction;
downstream agent tokens fell from 508,583 to 45,904, a 91.0% reduction.

The product question is not "can every model call be avoided?" It is "can expensive agent work be reserved for moments where the evidence suggests the user would value the interruption?"

Mini slice behavior

We also ran a smaller high-positive-density slice: 100 events, 5 watches, 500 labels, 77 positives. On a 5-event downstream run, both paths achieved F1 1.0. Watchline used fewer source and downstream calls, but more downstream tokens than the baseline.

If nearly everything is relevant, prefiltering has less room to help. Benchmarks need positive-rate regimes because proactive systems behave differently in sparse inboxes, noisy inboxes, and incident-heavy periods.

One 10-event probe exposed a concrete false positive: a to versus cc precision miss around email_0010. That is the kind of failure a useful benchmark should surface. It tells us where the matcher needs sharper field semantics before downstream delivery.

Artifact links and scope

The public benchmark lives in two places:

GitHub repo: qordinate-ai/watchbench
Hugging Face dataset: watchline/watchbench-email-v0

The GitHub repo contains the dataset files, evaluator, canonical result JSONs, and benchmark report. The Hugging Face dataset exposes the same data in loadable splits for people who want to inspect or run the labels without cloning the repo.

The measurement boundary is intentionally narrow. WatchBench Email v0 is synthetic email data, designed to make intent matching, false wakeups, and downstream cost visible. The reported Watchline path measures source-app access and downstream agent work; Watchline's hosted matching layer is part of the product path, not a downstream agent call.

FAQ

What is a useful wakeup?

A useful wakeup is a true positive interruption: the system observed a future event, matched it to a durable user intent, and delivered something the downstream agent should actually handle.

Why not benchmark only precision and recall?

Precision and recall are necessary, but incomplete. A proactive system can have strong F1 while spending too much on empty checks. Cost, latency, and false wakeups are part of the product behavior.

Is WatchBench Email v0 a reusable benchmark?

Yes. It is a public benchmark artifact with the dataset, evaluator, canonical result JSONs, and comparison report. The most useful way to read it is as a reproducible cost and wakeup-quality slice, with the measurement boundary stated in the repo.

You can inspect the repo at qordinate-ai/watchbench or load the dataset from watchline/watchbench-email-v0.

Tokens per useful wakeup