Thoughts on LLMs for programming
I thought it might be a good idea to record my current views on where LLMs currently are at when it comes to writing code.
Code completion
This is probably the least disruptive and most effective way to use LLMs for writing actual code. The speedup is excellent with most popular languages, but the noticeable drop-off in capability with niche languages continues to persist. My scripting language of choice lately is nushell and both Copilot and Cursor struggle with the syntax.
Staying in the flow can be a bit difficult sometimes, but for those of us used to LSPs it shouldn't be too bad. Cursor's keybinding implementation is a bit better at getting out of the way at moments when I'd like to ignore it compared to Copilot. Windsurf's neovim plugin is also quite good.
Asking questions and analyzing code
Chat models make exploring large codebases much easier. However, they seem to have limited ability to dive deep enough into dependency links in the code. At certain level of complexity they tend to give up and reach a conclusion prematurely. Typically I'll jump in and look where the model got to at that point, then nudge it to explore a bit further. After a few more iterations we usually get to the bottom of it.
Anecdotally, analysis seems to be more reliable for niche languages compared to code completion - there is less of a drop-off in quality.
Doing well documented common tasks
This is really excellent. Adding a documentation bundle to Cursor in particular works extremely well: The LLM will find the relevant documentation and be immediately able to suggest the correct change.
On the flip side, many of these cases can simply utilize a script instead, and its unclear how much you gain compared to, say, having a search box in your editor instead to find the relevant script in the docs. However, since in practice most documentation isn't very neatly organized, and automation can be somewhat spotty, I think this is a net win.
Writing new code in an existing codebase
This is where things can be hit or miss. If I let the model jump in straight into writing code, it will often make poor decisions - sometimes due to lacking context, other times because its biased towards action instead of thinking about the best approach.
What I've found extremely useful is Cline's plan / act separation. By default, the chat starts out in plan mode, where I go back-and-forth with the LLM outlining the task and figuring out exactly what needs to be done. Once we get to a good understanding, I switch to act mode and let the LLM write the code.
You can simulate this in Cursor too by using "ask" mode and then switching to "agent". Claude code recently added "plan" mode as well.
I've not experimented sufficiently with Cursor rules or CLAUDE.md memories. Should try those out.
Writing new code when you are not familiar with the technology
This is typically a bad idea. Instead, I tend to ask the LLM and then proceed to do something which I suspect many would find tedious: re-typing the code it provides. I know, it sounds like a waste of time, but it does wonders to help me take things in slowly. I always have the LSP on as well and pause to read the documentation as I go. I keep asking an endless amount of clarifying questions. I ask for documentation references (whenever possible) to ensure I dive deeper into the concepts and keep a healthy dose of skepticism in case of hallucinations.
Reasoning about complex bugs and behavior
Reasoning is hit-and-miss. In well-trodden areas (e.g. web development) they do really well. In other more niche areas (Rust code) there is some struggle. With technologies that have very little real-world training material (e.g. predicting terraform plan behaviour) its quite bad.
Hands-off automation
Surprisingly enough, LLMs also tend to fail here. Even when its possible to explain the exact steps to take, the error rate in some domains is too high to be able to do something reliably dozens to hundreds of times without supervision.
Other than the obvious (using temperature=0) I suspect that there are couple of interesting things that can be done here:
- Providing a constrained set of tools specific to the problem at hand. The idea is to allow the user to maximize the percentage of the task that is fully deterministic and prevent it from doing something disastrous while unsupervised.
- Providing a pen-and-paper equivalent tools (checklists, tables, etc). This should help the model keep track of the steps more reliably.
Unfortunately current cli tools have a few issues:
- they seem to be geared more towards interactive use,
- they have somewhat tedious way of defining new tools (MCP)
- they tend to include very powerful tools by default (e.g. shell execution).
I'm working on an easy to use cli tool that solves this - llmcli.
If done right, I suspect this could be a huge multiplier in productivity. Right now most of the time using chat tools is spent on interactively micromanaging and checking on the LLM, which limits the maximum potential.
Things to explore
Updating documentation
Keeping documentation up-to-date is a common problem. In theory, LLMs should be able to help with this by helping finding relevant documentation and suggesting updates.
Updating dependencies
Another huge problem. LLMs could potentially help here a bit by analyzing the changed parts of the dependency and suggesting updates. In concert with tests this could work incredibly well.
Self-updating hierarchical memories.
LLMs should be able to learn from their interactions and continuously update their memories. Memories should follow a hierarchy, where the top-level memories are concepts found to generalize with the user across different projects and tasks, while the lower-level memories are more specific to the current project. The hierarchy could also extend over time, with weeks and months having their own memories and compaction process - possibly with bias towards recent memories.
Conclusion
While there's been a lot of progress, LLMs still leave a lot to be desired. I suspect that most of the incoming innovation will focus on managing memory and context. Repeatable reliability, as well as capabilities that are not covered well by the training data will likely continue to be a challenge. To address reliability, I suspect we'll compensate with more comprehensive documentation and tools that provide stronger guardrails for the LLMs. Out-of-distribution capabilities continue to be elusive and would likely require radical innovation.
comment or share