shipping production AI
| |

Shipping Production AI: 20 Hard Lessons from Building RogueAI

Most AI demos die on the way to production. The model that wowed in a Loom video stops making sense when a real user mistypes a query, when latency spikes at 8am Monday, when a hosted LLM rate-limits you mid-month, when the bill arrives.

I have spent the last two years shipping production AI rather than demoing it. RogueAI.de is my personal AI lab — 20+ systems built end-to-end on Hetzner Germany, covering RAG pipelines, AI agents, LoRA fine-tuning, document AI, meeting copilot, and offline voice dictation. No client work, no commercial product. Just real systems running in production today.

This is what shipping production AI actually taught me — about infrastructure, about cost, about UX, about the gap between a working prototype and a working product.

Why I built 20+ production AI apps instead of polishing one

The conventional advice for shipping production AI is “build one thing well.” For two years I did the opposite. The reason is that a single application teaches you a single domain. Twenty plus applications teach you the patterns that hold across domains — and those patterns are the actual reusable assets.

A pension calculator is structured numerical data with regulatory constraints — Monte Carlo simulations across thousands of market paths. A tax document AI is OCR plus extraction plus rule-based reconciliation. A meeting copilot ingests audio transcripts and reasons about them post-hoc. A voice dictation tool handles real-time speech with strict latency budgets. CompliBot — 38 endpoints — handles thousands of documents across compliance workflows. Each one is a different shape of AI engineering problem.

The data shapes are different. The user expectations are different. The latency tolerances are different. What is the same — what stays the same as you build the twentieth one — is the architecture. Shipping production AI well comes down to a small number of decisions you make once and reuse forever.

The infrastructure decision behind shipping production AI: Hetzner Germany

Every RogueAI system runs on Hetzner Germany. Not AWS, not Azure, not GCP. The reason is not ideology — it is unit economics and EU data residency.

For shipping production AI, the typical SaaS architecture (cloud LLM + cloud vector store + cloud everything) burns money fast. A RogueAI application running on a Hetzner VPS costs less per month than the equivalent AWS configuration costs per day. For a portfolio of 20+ applications, the difference is the difference between hobby and unsustainable.

The architecture is consistent across the portfolio: Docker containers in isolated networks, Caddy as reverse proxy, automated health checks with automatic recovery, locally-deployed models via Ollama or whisper.cpp where latency demands it, no US cloud APIs anywhere in the data path. (For more on the German Mittelstand audit context this serves, see Cybersecurity Consulting Germany.)

There are tradeoffs. Hetzner does not have managed Kubernetes. There is no managed embedding service. You provision Postgres yourself. For shipping production AI at small-to-medium scale, those constraints are features — they force architectural simplicity.

When hosted LLMs help shipping production AI, and when they do not

The biggest cost variable in shipping production AI is the LLM call. Ten dollars per million tokens sounds cheap until you discover a single user can burn through millions of tokens in an afternoon.

I split RogueAI applications into three classes:

  • Hosted LLM (Anthropic, OpenAI): for applications where users tolerate higher latency and the value per call is high. Document AI workloads — tax documents, complex contracts — fit here. Quality matters more than cost.
  • Self-hosted Ollama: for applications where the volume is high and the per-call complexity is low. Voice dictation runs entirely on local models because the user expects sub-second response and the network round-trip itself would break the UX.
  • Hybrid: for applications where the first 90 percent of the work can be done with a small self-hosted model and only edge cases need a hosted LLM. Meeting copilots fit here — local model handles the bulk of summarisation, hosted LLM handles requests where quality matters most.

The architectural lesson for shipping production AI: do not commit to a single LLM provider. Build the application so the LLM is a swappable component. I have moved RogueAI applications between hosted and self-hosted multiple times, and the cost of doing so depends entirely on whether the abstraction was right from the start.

Cost management is a production AI feature, not a chore

The first time I shipped a RogueAI application without per-user token limits, a single misbehaving user generated a bill larger than the customer’s notional monthly subscription. Cost runaway is not a bug — it is a missing feature.

Every RogueAI application now has:

  • Per-user token budgets per session, per minute, per day
  • Aggressive caching at the prompt and at the embedding layer
  • Hard ceilings that fail loudly rather than fail expensively
  • Real-time cost dashboards that surface anomalies within minutes

This is unglamorous work. It is also what separates shipping production AI from running a science fair project. Every prototype that crosses into production has to grow up on cost discipline before it grows up on anything else.

RAG architecture for shipping production AI: what worked, what did not

Retrieval-augmented generation is the default architecture for shipping production AI on documents. CompliBot — 38 endpoints handling thousands of documents — is the largest RAG system in the RogueAI portfolio, and several other systems use lighter-weight RAG patterns. What I learned doing it across compliance documents, tax forms, meeting transcripts, and CRM records:

  • Chunking matters more than the embedding model. Spending two hours tuning chunk boundaries for a specific document type produces better results than upgrading from a free embedding model to a paid one.
  • Recency beats recall in most domains. Users would rather get a slightly less precise answer from this week’s document than a perfectly precise answer from a document that has been superseded.
  • Citations are the product, not a feature. A RAG application that cannot show its sources is not deployable to a regulated industry. Build citation rendering before you build the answer renderer.
  • Hybrid retrieval (semantic + keyword) wins. Semantic search alone consistently misses exact-match queries that users actually type. Keyword search alone misses paraphrased queries. Together they cover both.

For RAG, the simplest thing that works for shipping production AI is usually the right answer: PostgreSQL with pgvector, local or hosted embeddings, hybrid retrieval, and citation rendering as a first-class component.

LoRA fine-tuning as a production AI tool

A handful of RogueAI applications use LoRA fine-tuning rather than off-the-shelf hosted models. The pattern that worked: train a small specialised LoRA for a narrow task — image generation in a specific brand voice, content classification with a niche taxonomy, structured-output extraction for one document family — rather than expecting a general-purpose model to do it well.

Fine-tuning fits production AI in a specific niche: when the per-call cost of a hosted LLM is too high for the volume, and the task is narrow enough that a small model can match a large one with the right training data. It does not fit everywhere, but where it does fit, it transforms the unit economics.

Multi-modal patterns when shipping production AI

Two RogueAI applications handle non-text input. The patterns are very different from each other:

  • Voice dictation is real-time and entirely offline. The latency budget is 200ms end-to-end before the user notices. Self-hosted whisper.cpp is mandatory; the network alone consumes too much of the budget. The Electron tray app pattern — hotkey to start dictation — works because the local model can warm up while the user is mid-thought.
  • Meeting copilots are post-hoc. The latency budget is minutes rather than milliseconds. This is where hosted LLMs earn their cost — quality matters more than speed.

Document AI workloads (tax documents, compliance documents, custom batch pipelines) sit in between — offline batch processing where throughput matters more than latency. The general lesson for shipping production AI in multi-modal contexts: pick your latency budget first, then pick the model. Reversing that order produces beautiful demos that do not deploy.

UX patterns that worked across all 20 production AI apps

A few UX decisions hold up across every application I have shipped:

  • Streaming responses by default. Users tolerate three seconds of progressive text far better than three seconds of spinner.
  • Show retrieved context. When the AI answers a document question, render the source paragraphs underneath. Users trust answers they can verify.
  • Make AI suggestions opt-in, not opt-out. Auto-completing a tax document is intrusive; offering a “summarise this” button is not.
  • Always offer the manual path. AI is faster when it works, but every user occasionally needs the deterministic-tool version. Never make the AI path the only path.

These are not AI insights. They are UX insights. Shipping production AI well is mostly UX work that happens to involve LLMs.

Where I drew the line: ship vs polish

I shipped most RogueAI applications at “good enough” rather than “polished.” The decision rule for shipping production AI was simple: an application that solves a real problem 80 percent of the time, today, is more valuable than an application that solves it 95 percent of the time, six months from now.

The applications that have aged best are the ones I shipped fastest and iterated on. The ones that aged worst are the ones I tried to polish before launch — by the time they were “ready,” the problem had moved.

For shipping production AI, ship at 80 percent. Talk to the people using it. Improve the parts they actually care about. The other 15 percent of polish you imagined was usually for problems no one had.

What I would do differently next time I ship production AI

If I were starting RogueAI from scratch today:

  • Build the cost-control layer before the first AI call. Every project, day one.
  • Default to hybrid retrieval on every RAG system. Pure semantic search has bitten me too many times.
  • Standardise on a single observability layer across all applications. Right now each application has its own logging story; debugging at the portfolio level is harder than it should be.
  • Invest more in evaluation harnesses before scaling. The applications that are easiest to maintain are the ones with deterministic test cases I can run on every model swap.
  • Document architectural decisions as I made them. Twenty applications later, the institutional memory problem is real.

Most of these are obvious in retrospect. None of them were obvious before I had shipped a dozen production AI applications and watched them age.

The RogueAI portfolio as proof of shipping production AI

I built RogueAI to do real work, not to prove anything. But the portfolio has become the proof. For recruiters and clients who want to know what shipping production AI actually looks like, the answer is to look at 20+ real applications running today rather than to read another architecture deck.

If you are scoping production AI in 2026 and want pragmatic input from someone who has shipped a lot of it, that is what I do.


The RogueAI portfolio is live at rogueai.de — case studies for pension, tax document, voice, meeting copilot, fine-tuning, and CRM workflows are inspectable today. For AI engineering engagements, get in touch. Also see FwChange.com for firewall change automation.

Similar Posts