The Gap Between Prototype and Production
Most LLM projects die in the gap between prototype and production. A developer builds something impressive in a notebook — it answers questions, generates summaries, extracts data. Then the question becomes: how do we make this reliable, fast, observable, and affordable at scale?
That question is harder than it looks.
Production LLM systems are not just "deployed notebooks." They are distributed systems with stochastic outputs, high inference costs, and nondeterministic failure modes. Engineering them requires a stack of disciplines that most AI teams are still assembling.
This guide covers what we have learned building production LLM systems for enterprise clients across travel, healthcare, and ERP sectors.
Infrastructure Fundamentals
Model Selection and Hosting Strategy
Before writing a single line of orchestration code, you need to make a structural decision: hosted API or self-hosted model?
Hosted APIs (OpenAI, Anthropic, Google) are fast to integrate, require no infrastructure, and are continuously updated. They are the right default for most teams starting out. The cost model is predictable at low scale but becomes significant at volume.
Self-hosted open models (Llama 3, Mistral, Qwen) require GPU infrastructure investment but offer complete control, no data egress risk, and lower per-token cost at high throughput. They are the right choice when you have regulatory constraints or very high call volumes.
Most serious enterprise deployments end up with a hybrid routing model: a fast, cheap model for high-volume simple tasks; a frontier model for complex reasoning tasks; and a self-hosted fine-tuned model for proprietary domain knowledge.
Inference Infrastructure
For self-hosted inference, use a purpose-built serving framework:
- vLLM — best throughput for batch use cases, continuous batching
- TGI (Text Generation Inference) — Hugging Face, strong streaming support
- Ollama — local development and lightweight deployments
For production, deploy on GPU-optimized cloud instances (A100s or H100s for larger models, T4s or A10s for smaller models). Use autoscaling with a queue-based architecture to handle burst traffic without running GPUs at idle.
Latency Architecture
LLM latency has two components: time-to-first-token (TTFT) and time-per-output-token (TPOT). Users perceive TTFT as "loading" and TPOT as "typing speed."
Strategies for production latency management:
- Streaming responses — send tokens to the client as they generate; dramatically improves perceived latency
- Prompt caching — cache static prefix tokens; major providers now support this natively
- Speculative decoding — use a smaller draft model to predict tokens, verify with larger model; 2-3x speedup
- Response caching — cache semantically similar queries; works well for FAQ-style applications
- Model quantization — 4-bit or 8-bit quantization; 2-4x memory reduction with minimal quality loss on most tasks
Prompt Engineering at Scale
Prompt engineering in production is different from prompt experimentation. In production, prompts are:
- Version-controlled assets, not strings in code
- Tested against regression suites before deployment
- Separated from application logic
- Localized for different regions and user types
Use a prompt registry — a versioned store for your prompt templates. This can be as simple as a database table with version history, or a dedicated tool like LangSmith or Humanloop.
Structure your prompts with explicit sections:
[SYSTEM]
You are a travel booking assistant for TourOxy. Your role is to...
[CONTEXT]
Current booking state: {{booking_state}}
User preferences: {{preferences}}
[INSTRUCTION]
Based on the above, {{task_description}}
[FORMAT]
Respond in the following JSON structure:
{{output_schema}}
Observability: What You Must Track
Running LLMs in production without observability is operating blind. You need visibility into:
Latency metrics: p50, p95, p99 for TTFT and total response time. Break down by model, prompt template, and user segment.
Quality metrics: This is harder. Build an evaluation pipeline that samples production outputs and scores them. Use an LLM judge for automated scoring supplemented by human review for critical paths.
Cost tracking: Token usage by model, by user, by feature, by day. Set alerts for anomalous cost spikes — a prompt injection or runaway loop can spike costs in minutes.
Error rates: Track model errors, timeout rates, guardrail triggers, and output parse failures separately.
Input/output logging: Log all inputs and outputs (with PII filtering) for debugging and for building evaluation datasets. Storage is cheap. The ability to reproduce a failure is not.
Guardrails and Output Reliability
LLM outputs are probabilistic. You cannot guarantee what a model will produce, but you can build systematic defenses:
Input validation: Before sending to the model, validate that the input is within expected parameters. Reject inputs that exceed context limits, contain injection patterns, or fall outside the intended use case.
Output parsing with fallbacks: Parse structured outputs with schema validation. If the model fails to produce valid JSON, have a fallback strategy — retry with a more explicit prompt, fall back to a simpler response, or escalate to a human workflow.
Semantic guardrails: Use a classifier or lightweight model to check outputs for toxicity, hallucination indicators, off-topic responses, or policy violations before returning to the user.
Confidence scoring: Not all models expose logprobs, but when available, use them to flag low-confidence outputs for review rather than serving them directly.
Cost Engineering
At scale, LLM costs are a significant line item. Engineering cost discipline from the start is far easier than retrofitting it later.
Key cost levers:
- Model routing: Route to the smallest capable model. A simple classification task does not need GPT-4; it needs a well-prompted GPT-3.5 or a local Mistral.
- Context window management: Trim context aggressively. Every unnecessary token costs money and adds latency.
- Caching: Semantic caching of frequent queries (using vector similarity) can reduce API calls by 30-60% in many applications.
- Batching: For non-real-time use cases, batch requests rather than processing one at a time.
Deployment and Release Management
LLM systems need a deployment discipline that accounts for their non-deterministic nature:
Shadow mode: Run the new model or prompt version in parallel with production, logging outputs, before switching traffic.
Canary deployments: Route 5-10% of traffic to a new model version while monitoring quality and error metrics before full rollout.
Rollback capability: Always be able to roll back to the previous model or prompt version within minutes. LLM regressions are often subtle and only visible at scale.
A/B testing framework: For user-facing features, build A/B testing into the LLM layer so you can measure the business impact of model changes, not just the technical metrics.
What Separates Prototype from Production
The engineering gap between a promising LLM demo and a system that operates reliably at scale is real and significant. The teams that close it fastest are the ones that treat LLM systems as distributed systems first — with all the observability, reliability engineering, and operational discipline that entails — and AI applications second.
The model is only 20% of the work. The other 80% is the system around it.