Building with LLMs: What I've Learned After a Year
After spending the past year building products on top of large language models, I’ve accumulated a set of lessons that I wish someone had told me at the start. Not the hype-cycle takes, but the practical, in-the-trenches observations from actually shipping things.
Prompts are code, treat them like it
The single most important shift in my thinking was treating prompts as first-class engineering artifacts. Version them. Test them. Review them in PRs. When a prompt changes behavior in production, you need the same auditability you’d want for any other code change.
I’ve seen teams treat prompts as magic incantations — tweaking words until the output “feels right.” That doesn’t scale. What scales is systematic evaluation: a set of test cases, a scoring rubric, and a CI step that catches regressions.
Latency is the real product constraint
Most LLM product debates focus on capability — can the model do X? But in practice, the binding constraint is almost always latency. Users will tolerate a slightly worse answer if it arrives in 500ms instead of 5 seconds. This means:
- Streaming is not optional. If your product shows a loading spinner for 3+ seconds, you’ve already lost.
- Cache aggressively. Many queries are semantically similar. A good caching layer can eliminate 30-40% of model calls.
- Use the smallest model that works. Don’t default to the frontier model. Profile your tasks and right-size.
Evals are your test suite
Traditional software has unit tests. LLM software has evals. If you’re building without evals, you’re building without tests — you just don’t know what’s broken until a user tells you.
Start simple: 20 representative inputs, expected outputs, a script that scores them. You can get sophisticated later. The important thing is having something that runs automatically and tells you when quality has shifted.
The interface matters more than the model
I’ve built features where switching from GPT-4 to a fine-tuned smaller model made no noticeable difference to users — but changing the UI to show results incrementally instead of all-at-once transformed the experience.
This is the unsexy truth about LLM products: the model is a component, not the product. The product is the interface, the workflow, the way you handle errors and edge cases. The teams that win are the ones that obsess over the full experience, not just the model.
What’s next
We’re still in the early days of understanding how to build well with these tools. The patterns are stabilizing, but they’re not settled. I’ll keep sharing what I learn as the landscape evolves.
The most important thing? Build something, ship it, learn from real users. The gap between “playing with the API” and “running it in production” is where all the real lessons live.