# Your agent is not expensive. Your wakeup strategy is.

Token spikes in personal agents often come from broad wakeups, boot context, chat history, cron loops, and model depth applied to every branch.

Published: 2026-05-27
Updated: 2026-05-27
Canonical: https://watch.qordinate.ai/blog/your-agent-is-not-expensive-your-wakeup-strategy-is
Markdown: https://watch.qordinate.ai/blog/your-agent-is-not-expensive-your-wakeup-strategy-is.md
Author: Harpinder Singh
Author URL: https://www.linkedin.com/in/singhcoder/
Image: https://watch.qordinate.ai/images/blog/your-agent-is-not-expensive-your-wakeup-strategy-is.jpg

Tags:
- token cost
- agent economics
- wakeup strategy
- proactive agents

## Short answer

Most agent cost problems start before the model answers. They begin when the system wakes too often, reloads too much boot context, carries long chat history, runs cron loops over broad state, or uses deep reasoning on every branch. The question is not only "which model costs less?" It is "why did this model need to think right now?"

## Key takeaways

- Token cost is often a wakeup-design problem, not just a model-pricing problem.
- Chat channels, cron jobs, and long sessions can cause every small action to incur a large context tax.
- Reasoning depth should be saved for challenging branches, not applied evenly to every event.
- A useful metric is tokens per useful wakeup because proactive agents spend before the user sees value.
- Improved event filtering makes both local and hosted agents cheaper without reducing their ability.

## The model is not the only bill

Agent users frequently compare model prices, which makes sense. The model cost is clear, and the differences between a small local model, a flash model, and a frontier model can be significant.

Recent community discussions point at a sharper trend. People are asking which model is cheapest, but they are also asking why routine agent use consumes so much context in the first place.

A Hermes user shared a Telegram-heavy usage report and inquired if the token profile seemed normal: https://www.reddit.com/r/hermesagent/comments/1tnrq5z/telegram_excessive_use_of_tokens/. Another user wanted advice on which model to use for cron jobs and lightweight automations: https://www.reddit.com/r/hermesagent/comments/1tmfz9b/agent_recommendation/. A Claude Code user recounted a 9-hour autonomous session and the token workings behind it: https://www.reddit.com/r/ClaudeCode/comments/1tmm4sd/how_i_ran_a_9hour_autonomous_goal_session_with/.

The common theme is not "models cost money." That is already known. The common theme is that the runtime determines how often the model wakes, how much context it retains, and how much reasoning it expends.

## Boot context is a hidden tax

Many agents come with a large startup packet that includes system instructions, profile data, tool descriptions, memory summaries, channel context, safety rules, and workspace state. This packet may be helpful, but it might also load for small interactions that do not warrant it.

If a Telegram message triggers a new session, the cost is larger than the user's message. It includes the boot context, the message, and any history that the runtime adds. If the user frequently types `/new`, the agent may keep paying the setup cost.

The same issue arises in scheduled jobs. A daily digest can be inexpensive if the runtime knows precisely what changed and which context matters. It becomes pricey if each run starts by asking the agent to rediscover the situation.

The right question is: what is the smallest packet of context necessary to decide this event?

## Cron can turn absence into spend

Cron has a unique economic effect for agents. It leads to model calls even when nothing occurs.

A scheduled prompt like "check my inbox and tell me if anything matters" pays for the check, not only for useful results. If the inbox is quiet 95% of the time, much of the expense is confirming absence.

That is why cost should be evaluated per useful wakeup. A proactive system might appear cheaper per call but costly per useful interruption. WatchBench Email v0 was built around this idea: measure the total cost needed to generate useful wakeups, not only the model's performance after a wakeup occurs. The public artifact is here: https://github.com/qordinate-ai/watchbench.

The basic structure is straightforward:

1. Use inexpensive source filters first.
2. Apply smaller classifiers for uncertain events.
3. Activate the main agent only when the event is likely worth handling.
4. Keep evidence so the following model does not reread everything.

This approach turns cost into a routing problem.

## Reasoning depth should be conditional

Another recent Reddit question asked which model failure is worse: not enough depth on the hard branch, or too much depth on every branch: https://www.reddit.com/r/AI_Agents/comments/1tnbtz8/which_agentmodel_failure_bothers_you_more_not/.

This is a useful way to frame agent economics.

Not thinking enough on a challenging branch leads to wrong answers. Thinking too much on every branch causes delays and expenses. A proactive agent faces the same tradeoff at the event level. Some events need deep reasoning. Most do not.

An event from a newsletter shouldn't receive the same reasoning budget as an enterprise customer escalation. A minor calendar metadata edit does not require the same reasoning budget as a meeting rescheduled in a prep window. A passing CI email does not need the same reasoning budget as a repeated release-blocking failure.

Depth should be matched with relevance.

## Local models do not eliminate waste

Local agents change the bill, but they do not erase it.

A discussion about running a fully local agent for "$0" quickly brought up the obvious counterpoint: electricity, hardware, latency, maintenance, and model quality still matter: https://www.reddit.com/r/better_claw/comments/1tn9err/run_a_fully_local_ai_agent_for_0_no_bs/. Local inference can be the right choice for privacy, control, and low API costs. However, it can still waste cycles if the wakeup design is poor.

This is especially important for always-on local agents. A local model that checks broad state every minute may be cheaper than a hosted model call, but it remains noisy. It consumes power, generates logs, retries, produces failure modes, and demands user attention.

The best local agent is not the one that continuously thinks for free. It is the one that stays quiet until a relevant event justifies local reasoning.

## Cost is a product signal

High token usage is not merely an accounting issue. It indicates where the product is unclear.

If costs rise because the agent loads every chat message, the channel boundary is fuzzy. If costs rise because every cron job uses the same large model, the runtime lacks routing. If costs rise because memory keeps expanding, retrieval has become prompt debt. If costs rise because the agent checks every app repeatedly, there is no event layer.

The system should use cost as feedback:

- Which wakeups were useful?
- Which model calls found nothing?
- Which watches created false positives?
- Which context blocks were never used?
- Which repeated decisions can transition into cheaper rules?

This is how agent systems become more efficient over time.

## What to build instead

Start with a narrow wakeup contract.

For each background workflow, outline the source, the future condition, the evidence, the delivery target, and the allowed action. Then select the least expensive runtime path to make the decision.

Use deterministic filters whenever possible. Fields like sender, repository, label, attendee, time window, status, and thread are inexpensive. Use small classifiers when the signal is unclear. Deploy the costly agent only when a genuine handoff is likely.

The goal is not to make the agent less intelligent. The goal is to avoid spending cognitive resources on events that do not warrant it.

## FAQ

### Why do agents use so many tokens for small tasks?

Small tasks often involve considerable hidden context, such as system prompts, tool manifests, memory summaries, previous messages, and chat history. If the runtime starts a heavy session for every minor event, the fixed cost becomes overwhelming.

### Is a cheaper model the best solution?

Sometimes, but cheaper models do not address broad wakeups, unnecessary cron checks, or oversized context packets. Routing and filtering often reduce costs more than merely changing models.

### What is tokens per useful wakeup?

This is the total token cost needed to produce one useful interruption. For proactive agents, this is often a more accurate measure than cost per model call because many checks can occur before the user sees value.

### Should local agents always prefer local models?

No. Local models are beneficial for privacy, control, and predictable workloads. Hosted models can still be better for complex reasoning. The key design choice is to wake either model only when the event requires reasoning.

## Further reading

- https://www.reddit.com/r/hermesagent/comments/1tnrq5z/telegram_excessive_use_of_tokens/
- https://www.reddit.com/r/hermesagent/comments/1tmfz9b/agent_recommendation/
- https://www.reddit.com/r/ClaudeCode/comments/1tmm4sd/how_i_ran_a_9hour_autonomous_goal_session_with/
- https://www.reddit.com/r/AI_Agents/comments/1tnbtz8/which_agentmodel_failure_bothers_you_more_not/
- https://www.reddit.com/r/better_claw/comments/1tn9err/run_a_fully_local_ai_agent_for_0_no_bs/
- https://github.com/qordinate-ai/watchbench