Lately I’ve gottenheavilyback into making stuff, and it’s mostly because of LLMs.
I thought that I liked programming, but it turned out that what I like was making things, and programming was just one way to do that.
Since LLMs have become good at programming, I’ve been using them to make stuff nonstop, and it’s very exciting that we’re at the beginning of yet another entirely unexplored frontier.
There’s a lot of debate about LLMs at the moment, but a few friends have asked me about my specific workflow, so I decided to write it up in detail, in the hopes that it helps them (and you) make things more easily, quickly, and with higher quality than before.
I’ve also included a real (annotated) coding session at the end.
You can go there directly if you want to skip the workflow details.
For the first time ever, around the release of Codex 5.2 (which feels like a century ago) and, more recently, Opus 4.6, I was surprised to discover that I can now write software with LLMs with a very low defect rate, probably significantly lower than if I had hand-written the code, without losing the benefit of knowing how the entire system works.
Before that, code would quickly devolve into unmaintainability after two or three days of programming, but now I’ve been working on a few projects for weeks non-stop, growing to tens of thousands of useful lines of code, with each change being as reliable as the first one.
I also noticed that my engineering skills haven’t become useless, they’ve just shifted:
I no longer need to know how to write code correctly at all, but it’s now massively more important to understand how to architect a system correctly, and how to make the right choices to make something usable.
On projects where I have no understanding of the underlying technology (e.g.
mobile apps), the code still quickly becomes a mess of bad choices.
However, on projects where I know the technologies used well (e.g.
backend apps, though not necessarily in Python), this hasn’t happened yet, even at tens of thousands of SLoC.
Most of that must be because the models are getting better, but I think that a lot of it is also because I’ve improved my way of working with the models.
One thing I’ve noticed is that different people get wildly different results with LLMs, so I suspect there’s some element of how you’re talking to them that affects the results.
Because of that, I’m going to drill very far down into the weeds in this article, going as far as posting actual sessions, so you can see all the details of how I develop.
Another point that should be mentioned is that I don’t know how models will evolve in the future, but I’ve noticed a trend:
In the early days of LLMs (not so much with GPT-2, as that was very limited, but with davinci onwards), I had to review every line of code and make sure that it was correct.
With later generations of LLMs, that went up to the level of the function, so I didn’t have to check the code, but did have to check that functions were correct.
Now, this is mostly at the level of “general architecture”, and there may be a time (next year) when not even that is necessary.
For now, though, you still need a human with good coding skills.
I’ve built quite a few things recently, and I want to list some of them here because a common criticism of LLMs is that people only use them for toy scripts.
These projects range from serious daily drivers to art projects, but they’re all real, maintained projects that I use every day:
The largest thing I’ve built lately isan alternative to OpenClaw that focuses on security.
I’ve wanted an LLM personal assistant for years, and I finally got one with this.
Here, most people say “but you can’t make LLMs secure!”, which is misunderstanding that security is all about tradeoffs, and that what my agent tries to do is maximize security for a given amount of usability.
I think it succeeds very well, I’ve been using it for a while now and really like the fact that I can reason exactly about what it can and can’t do.
It manages my calendar and intelligently makes decisions about my availability or any clashes, does research for me, extends itself by writing code, reminds me of all the things I used to forget and manages chores autonomously, etc.
Assistants are something that you can’t really explain the benefit of, because they don’t haveonekiller feature, but they alleviate a thousand small paper cuts, paper cuts which are different for each person.
So, trying to explain to someone what’s so good about having an assistant ends up getting a reaction of “but I don’t need any of the things you need” and misses the point that everyone needs different things, and an agent with access to tools and the ability to make intelligent decisions to solve problems is a great help for anyone.
I’m planning to write this up in more detail soon, as there were some very interesting challenges when designing it, and I like the way I solved them.
Maybe my naming recently hasn’t been stellar, but this is asmall pendant that records voice notes, transcribes them, and optionally POSTs them to a webhook of your choice.
I have it send the voice notes to my LLM, and it feels great to just take the thing out of my pocket at any time, press a button, and record a thought or ask a question into it, and know that the answer or todo will be there next time I check my assistant’s messages.
It’s a simple thing, but the usefulness comes not so much fromwhatit does, but fromthe wayit does it.
It’s always available, always reliable, and with zero friction to use.
I’m planning to write something about this too, but this one is more of an art piece:
It’s a ticking wall clock that ticks seconds irregularly, but is always accurate to the minute (with its time getting synced over the internet).
It has various modes, one mode has variable tick timing, from 500 ms to 1500 ms, which is delightfully infuriating.
Another mode ticks imperceptibly more quickly than a second, but then pauses for a second randomly, making the unsuspecting observer question their sanity.
Another one races to :59 at double speed and then waits there for thirty seconds, and the last one is simply a normal clock, because all the irregular ticking drives me crazy.
Pine Townis a whimsical infinite multiplayer canvas of a meadow, where you get your own little plot of land to draw on.
Most people draw… questionable content, but once in a while an adult will visit and draw something nice.
Some drawings are real gems, and it’s generally fun scrolling around to see what people have made.
I’ve made all these projects with LLMs, and have never even read most of their code, but I’m still intimately familiar with each project’s architecture and inner workings.
This is how:
For the harness, I useOpenCode.
I really like its features, but obviously there are many choices for this, and I’ve had a good experience withPias well, but whatever harness you use, it needs to let you:
There are various other nice-to-haves, such as session support, worktree management, etc, that you might want to have depending on your project and tech stack, but those are up to you.
I’ll explain the two requirements above, and why they’re necessary.
You can consider a specific model (e.g.
Claude Opus) as a person.
Sure, you can start again with a clean context, but the model will mostly have the same opinions/strengths/weaknesses as it did before, and it’s very likely to agree with itself.
This means that it’s fairly useless to ask a model to review the code it just wrote, as it tends to mostly agree with itself, but it also means that getting adifferentmodel to review the code will lead to a big improvement.
Essentially, you’re getting a review from a second set of eyes.
Different models will have different strengths and weaknesses here.
For example (and this is very specific to today’s models), I find Codex 5.4 pretty nitpicky and pedantic.
This isn’t something I want when I want to get code written, but it definitely is something I want for a review.
The decisions Opus 4.6 makes correlate quite well with the decisions I would have made, and Gemini 3 Flash (yes, Flash!) has even been very good at coming up with solutions that other models didn’t see.
Everyone has a different opinion on what model suits them for which job, and models tend to alternate (e.g.
I used Codex as my main model back in November, switching back to Opus later).
To get the best results, you need a mix of all of them.
The workflow I use consists of different agents, and if the harness doesn’t have the ability to let agents talk to each other, you’ll be doing a lot of annoying ferrying of information between LLMs.
You probably want to cut down on that, so this is a very useful feature.
My workflow consists of an architect, a developer, and one to three reviewers, depending on the importance of the project.
These agents are configured as OpenCode agents (basically skill files, files with instructions for how I want each agent to behave).
I write these by hand, as I find it doesn’t really help if you ask the LLM to write a skill, it would be like asking someone to write up instructions on how to be a great engineer and then gave them their own instructions and said “here’s how to be a great engineer, now be one”.
It obviously won’t really make them better, so I try to write the instructions myself.
The architect (Claude Opus 4.6, currently) is the only agent I interact with.
This needs to be a very strong model, typically the strongest model I have access to.
This step doesn’t consume too many tokens, as it’s mostly chat, but you want this to be very well-reasoned.
I’ll tell the LLM my main goal (which will be a very specific feature or bugfix e.g.
“I want to add retries with exponential backoff to Stavrobot so that it can retry if the LLM provider is down”), and talk to it until I’m sure it understands what I want.
This step takes the most time, sometimes even up to half an hour of back-and-forth until we finalize all the goals, limitations, and tradeoffs of the approach, and agree on what the end architecture should look like.
It results in a reasonably low-level plan, with a level of detail of individual files and functions.
For example, tasks might be “I’ll add exponential backoff to these three codepaths of these two components in this file, as no other component talks to the LLM provider”.
I know that some people in this step prefer to have the LLM write out the plan to a file, and then they add their feedback to that file instead of talking to the LLM.
This is a matter of personal preference, as I can see both approaches working equally well, so feel free to do the reviews that way if it suits you more.
Personally, I prefer chatting to the LLM.
To clarify, in this step I’m notjustprompting, I’m shaping the plan with the help of the LLM.
I still have to correct the LLM a lot, either because it’s wrong or simply because it’s not doing things the way I’d do them, and that’s a big part of my contribution, as well as the part I get joy from.
This direction is what lets me call projectsmine, because someone else using the same LLM would have come up with a different thing.
When I’m satisfied that we’ve ironed out all the kinks (the LLM is very helpful at this, asking questions for what it doesn’t know yet and giving me options), I can finally approve the plan.
I’ve asked the architect to not start anything until I actually say the word “approved”, as a few models tend to be overeager and go off to start the implementation whentheyfeel like they understood, whereas I want to make sureI’mconfident it understood.
Then, the architect will split the work into tasks, and write each task out into a plan file, usually in more detail (and at a lower level) than our chat, and call the developer to start work.
This gives the developer concrete direction, and minimizes the high-level choices the developer can make, as the choices have already been made for it.
The developer can be a weaker, more token-efficient model (I use Sonnet 4.6).
The plan shouldn’t give it much leeway into what it can do, and its job is strictly to implement the changes in the plan.
When it’s done, it calls the reviewers to review its work.
Each reviewer will independently look at the plan and diff of the feature that was just implemented, and critique it.
For this step, I will always use at least Codex, sometimes I’ll add Gemini, and on important projects I’ll add Opus as well.
This feedback goes back to the developer, which either integrates it, if the reviewers agree, or it escalates to the architect when the reviewers disagree.
I’ve found that Opus is very good at choosing the right feedback to implement, sometimes ignoring feedback because it’s too pedantic (i.e.
hard to implement and unlikely to be a problem in practice).
Obviously, when I use objective assessments like “very good”, I really mean “I agree with it a lot”.
This way of working means that I still know every choice that was made above the function level, and can use that knowledge in subsequent runs.
I often notice the LLM recommend things that might be good in another codebase, but either won’t work or are suboptimal in my codebase, which shows that the LLM has some blind spots when researching the code.
I will often say “no, you should do this using Y”, at which point the LLM realizes that Y actually exists in the code and is a better way than the one it recommended.
On the flip side, when I’m not familiar enough with the technology to be on top of the architecture, I tend to not catch bad decisions that the LLM makes.
This leads to the LLM building more and more on top of those bad decisions, eventually getting in a state where it can’t untangle the mess.
You know this happens when you keep telling the LLM the code doesn’t work, it says “I know why!
Let me fix it” and keeps breaking things more and more.
That’s a real failure mode that has happened to me too many times now, which is why I ended up with this workflow.
For this reason, I try to understand things as much as I can while planning, even if I’m unfamiliar with the specific technology.
If I manage to steer the LLM well, it saves a lot of trouble later on.
Here’s an annotated transcript from a real session where I add email support to Stavrobot.
I’ve trimmed the tool calls and verbose bits, but the conversation and decision-making process is exactly as it happened.
I start by telling the LLM what I’d like to implement, at a very high level.
Sometimes I’ll give it more detail, especially if I already have an idea of how I want the implementation done.
The bot reads the code and finds all the relevant bits, and asks some questions.
In this session, I came with just a bit of an idea that hadn’t been thought through yet.
The LLM helps by asking specific questions, where I decide which way I want the implementation to go:
The LLM shapes the plan, giving it more detail:
I reply with a few things that I see it has missed.
This requires me to know the architecture well, and following this process keeps me up to date with all the changes at a moderately low level.
The LLM updates the plan and asks any final questions.
Sometimes I’ll remember something and tell the LLM before reading its message:
It adapts by incorporating my concern and repeating its message.
I do go back and read its message, as it usually has good questions and I don’t want to miss answering any of them.
The LLM needs me to explicitly say the word “approved” to proceed.
I remember something while the LLM is working and stop it to ask:
The LLM writes the plan, calls the developer, the reviewers, and eventually finishes.
I’ve omitted all the background tool/agent calls here for brevity.
I have an idea for improving the UX by allowing the bot to read incoming emails without configuring an outgoing SMTP server, in case the user wants to forward things like invoices or trip plans for it to read, but doesn’t want the bot to be able to send email.
The LLM scopes out effort.
If it’s a small change, I’ll usually do it as part of this session, otherwise I’ll write it to a GitHub issue for me to work on at a later time.
More tool/agent calls elided here.
I QA the feature and come back with issues:
The LLM goes off and fixes the problem.
Here I spot that it’s missed a better way of implementing the feature, and I suggest that it changes its implementation:
I have second thoughts about making the check generic, because of the special case.
The LLM thinks about it a bit and recommends something reasonable.
I QA and check again.
I realize that emails work slightly differently than phone numbers, and that the bot now ignores my custom email addresses.
I talk to the LLM about adding this:
The LLM misunderstood what I wanted, so I clarify with a concrete use case:
I ask it to make sure it takes care around a caveat.
And ask for some documentation changes.
I ask for a clarification to catch a potential gotcha:
The session continued for a bit with me doing more QA rounds, adding wildcard matching for email addresses, a question about SQLi, and catching a missing entry in the subagent allowlist.
The conversation went more or less as above, with me either catching an error or proposing an improvement, refining it with the LLM, and implementing it.
The whole feature took about an hour, start to finish, and I ended the session there as I was satisfied that the feature works well.
That’s the basic overview of my setup.
It’s nothing extremely fancy, but it works very well for me, and I’ve been really pleased with the reliability of the whole process.
I’ve been running Stavrobot 24/7 for close to a month now, and it’s been extremely reliable.
If you have any feedback or just want to chat, get me onBluesky, or email me directly.
Related Stories
Source: This article was originally published by Hacker News
Read Full Original Article →
Comments (0)
No comments yet. Be the first to comment!
Leave a Comment