Treat Agent Configuration as a Measured System, Not Vibes

Treat Agent Configuration as a Measured System, Not Vibes

If you don’t have a query-able history of your conversations with agents, then you lack a critical tool in improving their effectiveness.

I used to think that I wanted my agent to learn and remember facts about me. It felt cute and personalized. It learned (some of) my preferences over time. But it was a coat of paint over the same problem: it never got better at working the way I wanted it to work.

So, I had the agent build a scraper that streamed every session, prompt, invocation, etc. into Loki and Prometheus. I made scheduled and on-demand query scripts that emit dated markdown audit reports.

I now chat with those reports and continually improve my configuration.

Case in point: Last week I refactored my AGENT.md based on such a report; the audit told me which AGENT.md rules were dead weight, so I deleted them.

This week I noticed the agent was punting work to follow-ups instead of finishing tasks. Worse: features could never be “done,” and it kept pushing scope downstream, including bugs in its own unmerged code.

Instead of guessing, I went back to the data:

  • Assistant “follow-up” mentions: 2.37× higher post-refactor (per 1k messages)
  • “Leave for later” / “defer”: 2.52×
  • User frustration words: 3.18× (If you don’t curse at your agent, I question how much you’re really trying to use it)

Every issue was concentrated in a single model, not a global shift in behavior.

Root cause: the trimmed rules read as “minimum code, nothing speculative,” and so the model interpreted that as license to defer … everything.

Fix: a surgical 3-line edit clarifying that “no new features” doesn’t mean “stop finishing the task.” Then I wrote a script to re-measure on a 7-day cadence so I’d actually know whether it worked.

It did. And I have the data to prove it, including a significant drop in user frustration words.

Treating agent configuration as a measured system, not vibes, turned a frustrating week into a diagnosable, fixable, verifiable loop.

Measure. Manage. Done …

… And I didn’t have to wait for a new model version.

Originally shared on LinkedIn, May 16, 2026.