LLM Fundamentals That Actually Matter in Production
There is no shortage of content explaining how large language models work. Attention mechanisms, transformer architecture, pre training objectives. These are important if you are building models. But most engineers are not building models. They are building products on top of them. And the fundamentals that matter in production are not the same ones that matter in a research paper.
This is what I wish someone had told me before I started working with LLMs in production systems: the things that determine whether your product works have very little to do with how the model was trained, and everything to do with how you use it.
Tokens are not words
This sounds basic, but it trips up more engineers than you would expect. LLMs do not process words. They process tokens, which are subword units determined by the model's tokenizer. The word "unhappiness" might be three tokens. A simple URL might be ten. This matters because context windows are measured in tokens, not words. Costs are measured in tokens. Latency scales with tokens. If you are building a system that processes user input, you need to think in tokens, not characters or words. Ignoring this leads to truncated inputs, unexpected costs, and silent failures where your system quietly drops the end of a long document because it exceeded the context window.
Context windows are a constraint, not a feature
Marketing materials love to advertise large context windows. 128k tokens. A million tokens. Engineers sometimes interpret this as "you can throw everything in and the model will figure it out." It will not. Larger context windows do not mean the model pays equal attention to everything inside them. In practice, information in the middle of a very long context gets less attention than information at the beginning or end. This is not a bug, it is a property of how attention works. If your system relies on the model finding a specific detail buried in a 100 page document, you are building on shaky ground. Retrieval, chunking, and careful context construction matter far more than raw window size.
Prompts are code
Most engineering teams treat prompts as configuration. A string that someone writes, tests manually a few times, and ships. This is a mistake. Prompts are the interface between your system and the model. They are as critical as any API contract. And they are fragile. Small changes in wording can produce meaningfully different outputs. Adding or removing a single example can shift behavior. A prompt that works perfectly in testing can fail in production when it encounters input patterns you did not anticipate.
Treat prompts like code. Version them. Test them systematically. Review changes. Build evaluation pipelines that catch regressions before your users do. The teams that do this well ship faster and break less. The teams that do not spend their time debugging mysterious production issues that trace back to a prompt edit someone made on a Friday afternoon.
Evaluation is the hardest problem
Ask any team building with LLMs what their biggest challenge is, and if they are being honest, it is evaluation. How do you know if your system is getting better or worse? Traditional software has clear correctness criteria. A function returns the right value or it does not. LLM outputs exist on a spectrum. They can be partially correct, technically accurate but unhelpful, fluent but wrong, or correct but formatted in a way that breaks your downstream system.
You need evaluation at multiple levels. Automated metrics give you coverage but miss nuance. Human evaluation gives you quality but does not scale. LLM based evaluation (using one model to judge another) is increasingly common but introduces its own biases. There is no single right answer. But having no evaluation strategy at all, which is where many teams start, means you are flying blind. Every production LLM system needs at minimum a way to detect regressions, a set of representative test cases, and a feedback loop from real usage.
Garbage in, garbage out still applies
This principle is as old as computing, and it is more relevant with LLMs than ever. The quality of your model's output is bounded by the quality of what you put in. That includes the data used for fine tuning, the context you provide at inference time, and the ground truth you use for evaluation. If your retrieval system feeds the model irrelevant documents, the output will be confident and wrong. If your training data has systematic biases, the model will reproduce them. If your evaluation data does not represent real usage, your metrics will lie to you.
Engineers building on LLMs need to care deeply about data quality. Not just schema validation and null checks, but the semantic quality of the information flowing through the system. This is where most production LLM failures actually originate. Not from the model, but from what you gave it to work with.
Latency and cost are product decisions
In a research setting, you can wait 30 seconds for a response and spend a dollar per query. In production, latency and cost are product decisions that shape your architecture. Do you use a large model for everything, or route simple queries to a smaller model? Do you cache responses? Stream them? Pre compute common outputs? These are engineering decisions that have nothing to do with model architecture, but they determine whether your product feels fast or slow, whether it scales to a million users or breaks at ten thousand.
The fundamentals are not glamorous
None of this is as exciting as a new model architecture or a benchmark breakthrough. Tokenization, context management, prompt engineering, evaluation pipelines, data quality, cost optimization. These are the fundamentals that separate production systems from demos. They are not glamorous, but they are what make the difference between an LLM product that works and one that almost works. And in production, almost is not good enough.
This article was written by me and reflects my own personal and professional experience. AI models were used to assist with revision and editing.