The One-Armed Bandit

Well, I’ve spent a few weeks now playing around with Claude on two projects:

An MCP for Remember the Milk
An MCP for contacts management

I haven’t read a lot of commentary on how to code with AI, and I’ll say a little more on the kind of AI-related commentary I have read in a bit. Instead, I just let myself stumble into the challenges and try to figure out my way around them. The biggest ones were:

Session length. Once either project got into complexity, the amount of progress I could make in a session slowed down. Claude would end conversations for going on too long in the middle of a debugging session, and all the context would get dumped.
Context preservation. If you have to chain together a series of sessions to get one feature done, then you need some way for Claude to remember what it was up to.
“Behavioral.” Sometimes Claude does dumb or misguided things. Describing them would be as fun for you as making you listen to me talk about a dream I had, so I won’t.

So as I’d pick up on a problem, I’d look for a way to address it, and largely landed on:

User preferences that stress collaborative testing, step-by-step examination of a problem, and limited ability to move on until the user has said it’s okay to move on.
Specific project instructions that explain what the sources of truth are, where the local code lives, how to go about selecting what to work on, and further reinforcement that we don’t write scripts to test scripts that test the code, etc.
Reliance on continuity tools of some kind.
- Markdown notes of what it has gotten done so far, what it needs to know to return to work on a problem, etc.
- Continutity notes in GitHub issues, so it can pull the original issue, then record and catch up on its own progress with comments on the issue.
Remembering to explicitly dig out API docs and include them in the prompt with little nudges: “I see a parameter asking for a location id …”
Continuity prompts: “You were doing this, you recorded progress here, as always you can read about this entire project here.”

Luke sent me an article that very thoughtfully dissected this process:

“… the people I’ve seen most excited and effective about agentic work were deeply involved in constantly correcting and recognizing bugs or loops or dead ends the agent was getting into, steering them away from it, while also adding a bunch of technical safeguards and markers to projects to try and make the agents more effective. When willingly withholding these efforts, their agents’ token costs would double as they kept growing their context windows through repeating the same dead-end patterns; oddities and references to non-existing code would accumulate, and the agents would increasingly do unhinged stuff like removing tests they wrote but could no longer pass.

“I’ve seen people take the blame for that erratic behavior on themselves (‘oh I should have prompted in that way instead, my bad’), while others would just call out the agent for being stupid or useless.”

… then turned it around:

“It may sound demeaning, like I’m implying people lack awareness of their own processes, but it absolutely isn’t. The process of adaptation is often not obvious, even to the people doing it. There are lots of strategies and patterns and behaviors people pick up or develop tacitly as a part of trying to meet goals. Cognitive work that gets deeply ingrained sometimes just feels effortless, natural, and obvious. Unless you’re constantly interacting with newcomers, you forget what you take for granted—you just know what you know and get results.”

“By extension, my supposition is that those who won’t internalize the idiosyncrasies and the motions of doing the scaffolding work are disappointed far more quickly: they may provide more assistance to the agent than the agent provides to them, and this is seen as the AI failing to improve their usual workflow and to deliver on the wonders advertised by its makers.”

Phil Agre wrote about this years ago: “Most user interfaces are terrible. When people make mistakes it’s usually the fault of the interface. You’ve forgotten how many ways you’ve learned to adapt to bad interfaces.”

Both the contacts MCP and the Remember the Milk MCP were easy assignments: The contacts MCP just keeps a sqlite database and operates under some rules about when to nudge the user to check in on a contact and what to do when the user reports a contact. The RTM MCP takes advantage of a comprehensive, mostly straightforward API, mapping each tool to an endpoint with a pretty obvious purpose.

I won’t go into a ton of detail about the bad derails that still happened. The content of those derails is less the point than the simple fact that over a couple of weeks of evolving a pretty elaborate set of rituals to avoid memory issues, session length issues, and behavioral issues, there was still always a nagging sense of “when will this go wrong and how?”

For instance, the underlying Claude service itself is flakey. After a while I’d developed a pretty good sense of when a conversation was going long, but every now and then I’d get two or three messages into one – right at the point where the furniture had been rearranged in some function – and get a session length error. There went the context and all the sources of truth were out of sync with the code base.

Or it’d report writing the continuity note but truncate it. Or, or, or.

A boss of mine once said about an early scripting effort of mine, “it’s not that the bear dances well, but that it dances at all.”

That’s sort of where I’m at. In the end, I agree with this:

“… what is imagined is powerful agents who replace engineers (at least junior ones), make everyone more productive, and that will be a total game changer. LLMs are artifacts. The scaffolding we put in place to control them are how we try to transform the artifacts into tools; the learning we do to get better at prompting and interacting with the LLMs is part of how they transform us. If what we have to do to be productive with LLMs is to add a lot of scaffolding and invest effort to gain important but poorly defined skills, we should be able to assume that what we’re sold and what we get are rather different things.”

“That gap implies that better designed artifacts could have better affordances, and be more appropriate to the task at hand. They would be easier to turn into productive tools. A narrow gap means fewer adaptations are required, and a wider gap implies more of them are needed.

“Flipping it around, we have to ask whether the amount of scaffolding and skill required by coding agents is acceptable. If we think it is, then our agent workflows are on the right track. If we’re a bit baffled by all that’s needed to make it work well, we may rightfully suspect that we’re not being sold the right stuff, or at least stuff with the right design.”

… because as I painstakingly evolved my “get working code out of this thing” workflow, there were two things on my mind:

Learning how to write all this for myself would keep me from having the tool I’ve imagined any time soon.
If I’m sitting here in my living room working through this process with these readily identified limitations, other people who are starting from a place of more sophistication and more ability to do the meta-work of hiding all this tool-building I’m doing will probably be getting on that. In fact, they are. I added an MCP called “vibe check” that arrested a few spinouts before they happened by making the LLM ask what its original prompt was before it does anything. The number of memory persistence MCPs is growing.

On that latter bullet, I guess this puts me on the vaguely optimistic side about all this: My Brompton is a great implementation of a bike. Penny Farthing bikes are – given all we know today – a terrible implementation of a bike. If I tried to do an eight-mile ride downtown on a Penny Farthing bike, I’d say “bikes don’t work for commuting shortish distances.” With the Brompton, on the other hand, the gearing is excellent and does a lot to offset those 16” wheels. It’s fine for six or eight miles. Given the kind of Schwinn I had in grade school, with a banana seat, a big lever for a shifter, and high handle bars for that cool chopper look, I’d probably say “bikes aren’t great for commuting medium distances, but are fine for a spin around the neighborhood, and I can see how a few design changes could make them better.”

The metaphor breaks down quickly the second you shift out of “does it work” questions and move into asking “what is the return given the consumption of resources?” whether that’s electricity, water, developer hours lost to correcting spinouts, support tickets, or the human toll of inevitable misapplications at all levels of society.

The one-armed bandit

So, I’m glad to be sitting here writing instead of sitting on top of Claude, waiting for it to invent a non-existent API call, forget that a given parameter is an id and not a string, crash and burn with a partially edited private function I hadn’t realized was in there, or take it upon itself to push a broken image into production.

My two MCPs work pretty well, harmonizing with each other the way I hoped they would, giving me a conversational interface into personal planning that does a good job of “remembering” stuff and getting it back in front of me when I need it to. Claude itself has written READMEs that vastly oversell what it is accomplishing, because it has never met a project it has worked on with anything other than intimations of brilliance and completeness.

I had a good experience working on these things in a few ways. I got working tools, learned a bit more about what the tech is good for and not, and just as I was deep in the process, MCPs flared up as a security topic at work. It was very, very useful to be able to listen patiently, knowing something about how these things work, and have a useful opinion or two about how to characterize the risk they pose.

During this period, there was no avoiding The AI Discourse, and that was less fun. The discursive environment is up to the same shit it is with everything, and my deep annoyance with rhetorical moves that involve psychologizing people with whom we disagree has found no relief.

But for as much as the experience was edifying and interesting, the image that kept coming to mind was a slot machine, or a video poker game.

I once ended work early during a week in Las Vegas, gave myself an allowance to try gambling, and had a pretty terrible experience with it all. Years later I read that humans are succeptible to gambling addiction because we’re very inclined to seek patterns, and very inclined to misinterpret things as patterns, and easily unhinged by rewards we believe we can unlock if we just figure out the pattern.

If you’ve never gambled but have sat up all night trying to beat that one boss, getting worse and worse at it as you get more and more tired and frustrated, that’s sort of how a bad session with Claude felt, because yes, these things “work,” but for particular definitions of “work” that have to include the derails, setbacks, workarounds, constant vigilance, and ever-present awareness that Anthropic, OpenAI, Google, or whichever could twist some tuning screw down in the clockwork and alter the thing’s operating model in a way that throws it all out of sync, creating more digressions to unwind.

In the end, that level of engagement with the toolmaking tool to get the tools I was trying to make was sort of exhausting. I kept thinking of Bill Murray in Groundhog Day, waking up every morning to Sonny and Cher, wondering how far I’d make it.

Funny enough, Claude itself sort of pointed that out to me, because a few weeks ago I added something to my user preferences called “Focus Guardian,” which is a set of directives to catch me spiraling on stuff that is not what I want to spend my time on. Diving into yet another session to add just one more feature, with all the uncertainty of whether it’d be a quick walk in the park or a long slog, the focus guardian protocol kicked in and asked “are you doing this because it furthers one of the things that brings you joy or because you’re curious about some new tool or over-optimizing? Is there a writing project you could tackle instead?”

I guess AI “works.”

The one-armed bandit#

The one-armed bandit