Loading…
Loading…

A personal AI setup with two flat-rate ends and a thin metered middle. Brain, hands, and a fifteen-minute production fix from my phone.
Friday afternoon. A user pinged me: the welcome discount code on Fluent (getfluent.academy) wasn't working at checkout. I was on my phone, nowhere near a laptop. I opened Telegram and sent a note to TARS, asking it to check the Lemon Squeezy API and the discount logic. Within a minute, TARS came back with the diagnosis: a wrong variant ID in the configuration. I switched to Claude Code on my phone, opened an SSH session into my workstation over Tailscale, fixed the variant ID, and deployed to production. Then I asked TARS to run a checkout end-to-end the way a real customer would. It did. Discount applied. Issue closed.
Total time: about fifteen minutes. Laptops involved: zero.
TARS is one of two AI agents I run on my home workstation. The other is CASE. The names are not decoration. They come from Interstellar, one of my all-time favorite films, and the choice was deliberate. In the film, TARS and CASE are not the brains of the mission. Cooper and Brand are the brains. TARS and CASE are the hands: articulated, capable, loyal, occasionally witty, and the ones who actually go into the wave, hold the ship steady during docking, and do the physical work while the humans decide. They are walking, talking pairs of hands.
That division is exactly what I want from a personal AI stack. Claude is the brain that thinks with me. TARS and CASE are the hands that work for me. Everything in this article follows from that one idea.
One note before I go deeper: everything below is my personal setup, on my own hardware at home. It runs my personal projects (Fluent and a handful of side projects) and has nothing to do with my work at Tricentis, our networks, or our data. This is hobby infrastructure that happens to be genuinely useful.
Here is the part that took me a while to figure out, and that I rarely see written down honestly.
Agents are conceptually amazing and economically punishing. Hand an agent a real task and it will loop. It retries, it reasons, it calls tools, it second-guesses itself, it tries again. That loop is exactly what you want, because that is what makes it an agent and not a chatbot. But every iteration is tokens, and tokens are money on a pay-per-call API. Even with a "cheap" model like Qwen3.6 or one of the smaller OpenRouter options, an autonomous run on a non-trivial task can rack up costs fast enough to make you flinch the next time you fire it off. So most people end up doing what I did at first: throttle the agent, restrict its scope, give it shorter tasks, and quietly stop using it for the things it would actually be best at. That is a self-defeating outcome.
The setup I have landed on solves this with two flat-rate ends and a thin pay-per-token middle.
Claude Max is one end. It is where the heavy synchronous work happens: deep research, complex coding, brainstorming, architecture, writing (this article included). Claude Code is unleashed on anything non-trivial that needs real engineering judgment, like Friday's variant ID fix and deploy. The cost is a fixed monthly number, no meter anxiety, and the quality is high enough that I never feel like I am compromising to save tokens.
Local LLMs are the other end. On boulderlin, my home workstation, I run Ollama with gemma4:26b-64k, a Modelfile-derived variant of gemma4:26b with a 64K context window baked in. About 20GB sits in the VRAM of the RTX 5080, with some spillover into RAM, splitting roughly 77/23 GPU/CPU. Two agents share that backend: TARS (Hermes Agent) and CASE (OpenClaw). Both are reachable from anywhere via Telegram, Tailscale, and SSH. The agents do the asynchronous, ambient, agentic work: monitoring, scripting, ad-hoc checks, the loops that would burn API credits. The marginal cost of a 50-turn run is the electricity it took to run it. I never throttle them.
Pay-per-token APIs are not gone. They are just demoted. I keep an OpenRouter account active for benchmarking new models when they drop, or for the occasional comparison test against my local setup. That is the right shape: the metered tier becomes the exception, used surgically, not the default that powers your whole workflow.
Stated economically, the same brain-hands split from the top of this article holds: Claude Max is the brain that thinks with me, on a flat fee. TARS and CASE are the hands that work for me, on flat electricity costs. The metered APIs are the lab where I evaluate what to bring into the stack next.
Once you have a brain and a pair of hands, the next question is who does what. The answer that emerged for me, after some trial and error, is simple enough to fit on a sticky note: Claude Code authors. The agents operate.
This is not a hard rule. The local agents are perfectly capable of writing small Python scripts, one-off automations, quick utilities. Gemma4 at 26B parameters is plenty for that kind of work. But the moment a task involves real architectural judgment, multi-file refactoring, debugging something subtle, or reasoning across an unfamiliar codebase, I do not even try the local models. I open Claude Code and go straight to the brain. The result is better, the iteration is faster, and the cost is already paid for under Max.
The local agents pick up the output and run it. They host the cron jobs. They handle ad-hoc execution. They monitor logs. They check APIs. They retry when something flakes. They send me Telegram pings when something needs my attention. They are, in the most literal sense, the operators of the code that Claude and I write together.
Friday's Fluent incident is the cleanest example of this pattern in motion, because the handoff happened twice in fifteen minutes:
Brain, hands, brain, hands. The same pattern, twice in one incident, and the only interface I touched was my phone.
There is a second, less obvious benefit to this split that I only noticed after a few weeks of running the setup: the agents persist, but Claude Code sessions do not. When I close a Claude Code session, that conversation is over. The thinking is gone unless I deliberately captured it. The agents, by contrast, keep running on boulderlin twenty-four hours a day, with their own scripts, their own scheduled jobs, their own state. They are the durable layer. Claude is the spike of intense cognition that comes in, does the hard part, and leaves the artifacts behind for the agents to operate. That asymmetry is the architecture, not a bug.
Here is the part of the setup I am still figuring out, and I think it is worth being honest about because it is genuinely interesting.
I have two agents running side by side on the same machine, against the same Ollama backend, on the same model. By every reasonable expectation they should behave more or less identically. They do not.
CASE feels faster. It responds quicker, completes tasks with less ceremony, and gets to "done" with fewer turns. If I need a quick check or a one-shot answer, CASE is usually the one I ping.
TARS feels more tenacious. When a task gets messy, TARS does not give up easily. It retries, it adjusts its approach, it keeps grinding until it nails the result. If I hand off something with real complexity or ambiguity, TARS is more likely to come back with the right answer rather than a graceful failure.
Same model. Same hardware. Same prompt budget on my side. Different agent personalities. The only variable is the framework around the model: Hermes Agent for TARS, OpenClaw for CASE. Different prompt scaffolding, different tool definitions, different orchestration loops, different defaults. That is enough to produce two genuinely distinct working partners. It is also a quiet reminder that in the agentic era, the model is only part of the story. The framework is the rest, and it shapes behavior more than most people give it credit for.
I have not picked a winner. I do not know yet whether I will. It is possible the right answer is to keep both and route by task type: CASE for fast checks, TARS for tenacious work. It is also possible one of them will pull ahead as I lean on it harder over the next few weeks and I will quietly retire the other. Right now, running both is cheap (they share the same backend), informative, and a useful demonstration of something most people overlook: two implementations of the same idea, on the same model and the same hardware, behave measurably differently under load. The framework is not a thin wrapper around the model. The framework is half the system.
The Fluent fix is the most dramatic example, but it is not the daily one. Most of what this setup gives me is quieter, and that is the point. A short sample from the past couple of weeks:
Status checks from anywhere. I am out of the house, a backup job is running, a cron is overdue, a build is grinding away, or I just want to know if a long-running script finished. I open Telegram, ask CASE or TARS, and get a real answer thirty seconds later. No laptop, no VPN dance, no hunting for a logging dashboard. The agent runs the check on the workstation, reads the actual file or process or log, and tells me what is true right now. This sounds small. It is not. The cumulative effect of being able to ask my own infrastructure questions and get real answers from anywhere is that the workstation stops feeling like a place I have to go and starts feeling like a service I can use.
Claude Code from my phone, against the actual production environment. This one still surprises me every time. SSH into boulderlin over Tailscale, launch Claude Code, and I have the full development environment in my pocket. Real files, real repos, real deploy scripts, real keys. Friday's Fluent fix worked because of this. So did a recent late-evening tweak to a side project where I was already in bed and decided I might as well ship the change instead of waiting until morning. The phone is not the IDE, the workstation is. The phone is just the window into it.
Overnight experiments without watching the meter. I can hand TARS or CASE a longer task and walk away. Benchmark a model variant. Run a batch job. Test a new Modelfile against a corpus of prompts. The kind of thing that on a metered API would force me to either pre-budget carefully or babysit the run. On a local model, I just kick it off, close the laptop, and check the result in the morning. The agents handle the loop, the retries, the logging, the Telegram ping when it is done. I get the results. The cost is electricity.
None of these is the kind of moment that makes a great demo video. But together they change the relationship I have with my own infrastructure. The workstation is no longer a place. It is a colleague.
If you take one thing away from this article, take the frame.
Claude Max is the brain that thinks with me. TARS and CASE are the hands that work for me. The metered APIs are the lab where I evaluate what comes next. Two flat-rate ends, one thin metered middle, and a clear division of labor between cognition that needs depth and operations that need persistence. That is the architecture. Everything else is implementation detail.
What I do not have figured out yet: which agent framework wins long-term, how far I can push the local model before the latency floor stops being acceptable, and what changes when the next generation of open-weights models lands on my workstation later this year. Those are the threads I will pull in the next few articles in this series.
If you are building something similar, or have already gone further than I have, I want to hear about it. The most useful conversations I have had on this stack so far have been with people who pushed back on choices I thought were settled. Push back on these.