Shipping Production AI: 20 Lessons Building RogueAI

Most AI demos die on the way to production. The model that wowed in a Loom video falls apart the moment a real user mistypes a query, latency spikes at 8am on a Monday, the hosted LLM rate-limits you mid-month, or the bill lands.

For the last two years I have shipped AI rather than demoed it. RogueAI.de is my personal AI lab — 20+ systems built end-to-end on Hetzner Germany, covering RAG pipelines, AI agents, LoRA fine-tuning, document AI, a meeting copilot, and offline voice dictation. No client work, no commercial product. Real systems running in production today.

Here is what that actually taught me — about infrastructure, cost, UX, and the gap between a working prototype and a working product.

Why I built 20+ production AI apps instead of polishing one

The conventional advice is “build one thing well.” For two years I did the opposite. One application teaches you one domain. Twenty teach you the patterns that hold across domains — and those patterns are the real reusable asset.

A pension calculator is structured numerical data with regulatory constraints — Monte Carlo simulations across thousands of market paths. A tax document AI is OCR plus extraction plus rule-based reconciliation. A meeting copilot ingests audio transcripts and reasons about them after the fact. A voice dictation tool handles real-time speech on a strict latency budget. CompliBot — 38 endpoints — handles thousands of documents across compliance workflows. Each is a different shape of engineering problem.

The data shapes differ. The user expectations differ. The latency tolerances differ. What stays constant by the twentieth build is the architecture. It comes down to a small number of decisions you make once and reuse forever.

The infrastructure decision behind shipping production AI: Hetzner Germany

Every RogueAI system runs on Hetzner Germany. Not AWS, not GCP. The reason is not ideology — it is unit economics and EU data residency.

The typical SaaS stack — cloud LLM, cloud vector store, cloud everything — burns money fast. A RogueAI application on a Hetzner VPS costs less per month than the equivalent AWS configuration costs per day. Across 20+ applications, that is the difference between a sustainable lab and an unaffordable one.

The architecture is consistent across the portfolio: Docker containers in isolated networks, Caddy as reverse proxy, automated health checks with automatic recovery, models deployed locally via Ollama or whisper.cpp where latency demands it, and no US cloud APIs anywhere in the data path. (For more on the German Mittelstand audit context this serves, see Cybersecurity Consulting Germany.)

There are tradeoffs. No managed Kubernetes. No managed embedding service. You provision Postgres yourself. At small-to-medium scale those constraints are features — they force architectural simplicity.

When hosted LLMs help shipping production AI, and when they do not

The biggest cost variable is the LLM call. Ten dollars per million tokens sounds cheap until a single user burns through millions of tokens in an afternoon.

I split RogueAI applications into three classes:

Hosted LLM (Anthropic, OpenAI): for applications where users tolerate higher latency and the value per call is high. Document AI workloads — tax documents, complex contracts — fit here. Quality matters more than cost.
Self-hosted Ollama: for applications where the volume is high and the per-call complexity is low. Voice dictation runs entirely on local models because the user expects sub-second response and the network round-trip itself would break the UX.
Hybrid: for applications where the first 90 percent of the work can be done with a small self-hosted model and only edge cases need a hosted LLM. Meeting copilots fit here — local model handles the bulk of summarisation, hosted LLM handles requests where quality matters most.

The architectural lesson: never commit to a single LLM provider. Build the application so the LLM is a swappable component. I have moved RogueAI applications between hosted and self-hosted more than once, and the cost of doing so depends entirely on whether the abstraction was right from the start.

Cost management is a production AI feature, not a chore

The first time I shipped a RogueAI application without per-user token limits, one misbehaving user generated a bill larger than the customer’s notional monthly subscription. Cost runaway is not a bug — it is a missing feature.

Every RogueAI application now has:

Per-user token budgets per session, per minute, per day
Aggressive caching at the prompt and at the embedding layer
Hard ceilings that fail loudly rather than fail expensively
Real-time cost dashboards that surface anomalies within minutes

This is unglamorous work. It is also what separates a production system from a science fair project. Every prototype that crosses into production grows up on cost discipline before it grows up on anything else.

RAG architecture for shipping production AI: what worked, what did not

Retrieval-augmented generation is the default architecture for AI over documents. CompliBot — 38 endpoints handling thousands of documents — is the largest RAG system in the RogueAI portfolio, and several other systems use lighter-weight RAG patterns. What I learned doing it across compliance documents, tax forms, meeting transcripts, and CRM records:

Chunking matters more than the embedding model. Spending two hours tuning chunk boundaries for a specific document type produces better results than upgrading from a free embedding model to a paid one.
Recency beats recall in most domains. Users would rather get a slightly less precise answer from this week’s document than a perfectly precise answer from a document that has been superseded.
Citations are the product, not a feature. A RAG application that cannot show its sources is not deployable to a regulated industry. Build citation rendering before you build the answer renderer.
Hybrid retrieval (semantic + keyword) wins. Semantic search alone consistently misses exact-match queries that users actually type. Keyword search alone misses paraphrased queries. Together they cover both.

For RAG, the simplest thing that works is usually the right answer: PostgreSQL with pgvector, local or hosted embeddings, hybrid retrieval, and citation rendering as a first-class component.

LoRA fine-tuning as a production AI tool

A handful of RogueAI applications use LoRA fine-tuning rather than off-the-shelf hosted models. The pattern that worked: train a small, specialised LoRA for a narrow task — image generation in a specific brand voice, content classification with a niche taxonomy, structured-output extraction for one document family — instead of expecting a general-purpose model to do it well.

Fine-tuning fits one specific niche: when the per-call cost of a hosted LLM is too high for the volume, and the task is narrow enough that a small model can match a large one given the right training data. It does not fit everywhere, but where it fits, it transforms the unit economics.

Multi-modal patterns when shipping production AI

Two RogueAI applications handle non-text input. The patterns are very different from each other:

Voice dictation is real-time and entirely offline. The latency budget is 200ms end-to-end before the user notices. Self-hosted whisper.cpp is mandatory; the network alone consumes too much of the budget. The Electron tray app pattern — hotkey to start dictation — works because the local model can warm up while the user is mid-thought.
Meeting copilots are post-hoc. The latency budget is minutes rather than milliseconds. This is where hosted LLMs earn their cost — quality matters more than speed.

Document AI workloads — tax documents, compliance documents, custom batch pipelines — sit in between: offline batch processing where throughput matters more than latency. The general lesson for multi-modal work: pick your latency budget first, then pick the model. Reverse that order and you get beautiful demos that do not deploy.

UX patterns that worked across all 20 production AI apps

A few UX decisions hold up across every application I have shipped:

Streaming responses by default. Users tolerate three seconds of progressive text far better than three seconds of spinner.
Show retrieved context. When the AI answers a document question, render the source paragraphs underneath. Users trust answers they can verify.
Make AI suggestions opt-in, not opt-out. Auto-completing a tax document is intrusive; offering a “summarise this” button is not.
Always offer the manual path. AI is faster when it works, but every user occasionally needs the deterministic-tool version. Never make the AI path the only path.

These are not AI insights. They are UX insights. Most of this work is UX work that happens to involve LLMs.

Where I drew the line: ship vs polish

I shipped most RogueAI applications at “good enough,” not “polished.” The rule was simple: an application that solves a real problem 80 percent of the time, today, beats one that solves it 95 percent of the time, six months from now.

The applications that aged best are the ones I shipped fastest and iterated on. The ones that aged worst are the ones I tried to polish before launch — by the time they were “ready,” the problem had moved.

So ship at 80 percent. Talk to the people using it. Improve the parts they actually care about. The other 15 percent of polish you imagined was usually for problems no one had.

What I would do differently next time I ship production AI

If I were starting RogueAI from scratch today:

Build the cost-control layer before the first AI call. Every project, day one.
Default to hybrid retrieval on every RAG system. Pure semantic search has bitten me too many times.
Standardise on a single observability layer across all applications. Right now each application has its own logging story; debugging at the portfolio level is harder than it should be.
Invest more in evaluation harnesses before scaling. The applications that are easiest to maintain are the ones with deterministic test cases I can run on every model swap.
Document architectural decisions as I made them. Twenty applications later, the institutional memory problem is real.

All obvious in retrospect. None of it was obvious before I had shipped a dozen applications and watched them age.

The RogueAI portfolio as proof of shipping production AI

I built RogueAI to do real work, not to prove a point. But the portfolio became the proof. For recruiters and clients who want to know what production AI actually looks like, the answer is 20+ real applications running today — not another architecture deck.

If you are scoping production AI in 2026 and want pragmatic input from someone who has shipped a lot of it, that is what I do.

The RogueAI portfolio is live at rogueai.de — case studies for pension, tax document, voice, meeting copilot, fine-tuning, and CRM workflows are inspectable today. For AI engineering engagements, get in touch. Also see FwChange.com for firewall change automation.

Shipping Production AI: 20 Hard Lessons from Building RogueAI