Back to Insights
7 min read

We've Shipped AI to Production 14 Times. Here's What We Learned.

Most AI projects fail in the last mile. After 14 production deployments, we've learned where the real problems are—and they're not where most teams look.

D
David Okonkwo
March 8, 2024

Over the past two years, we've helped 14 companies get AI systems into production. Not demos. Not prototypes. Actual production systems handling real traffic.

Here's what we've learned. Most of it isn't about AI.

The Demo Trap

Every AI project starts with a demo that looks amazing. GPT-4 writes surprisingly good code. Claude analyzes documents better than most humans. Stable Diffusion generates images that would have seemed like magic five years ago.

The problem is the demo isn't the product. A demo needs to work once, on cherry-picked inputs, in front of people who want to be impressed. A product needs to work every time, on whatever garbage users throw at it, for people who will be annoyed when it doesn't.

The gap between those two things is where most AI projects die.

Where Projects Actually Fail

We've seen projects fail at every stage. But the distribution isn't what you'd expect. Very few fail because the model isn't good enough. Most fail because of everything around the model.

About 40% fail in data preparation. The model works great on your test data, but your test data doesn't look like real data. Users type weird things. Historical records have inconsistencies nobody documented. Edge cases you didn't think about turn out to be 30% of traffic.

About 30% fail in production operations. The model that runs in 200ms on your laptop takes 3 seconds in production because you didn't think about batching. Costs spiral because nobody set up proper caching. The system goes down on Sunday night and nobody knows how to debug it.

About 20% fail in product design. The model is technically impressive, but users don't trust it, or don't understand how to use it, or would rather do the task manually.

Only about 10% fail because the core model doesn't work. And even then, it's usually fixable with better prompts or more training data.

What We Do Differently

We've developed a process that front-loads the hard parts. Before we write any model code, we spend 2-4 weeks on what we call "production scaffolding."

First, we build the data pipeline. Not a one-time data export—an actual pipeline that can ingest, validate, and version data continuously. This forces us to confront data quality issues early, when they're cheap to fix.

Second, we build the serving infrastructure with placeholder models. A model that returns random responses, but does so through the actual API, with actual latency monitoring, actual cost tracking, and actual error handling. This sounds wasteful, but it means we catch operational issues months before they become crises.

Third, we build the feedback loop. Users will always find cases where the model is wrong. The question is whether you learn from those cases or whether they just accumulate as complaints. We build explicit mechanisms for collecting failures, prioritizing fixes, and measuring improvement.

Only then do we start on the actual model work.

The Counterintuitive Part

This approach takes longer upfront. A team doing "pure model development" will show more impressive demos in month one.

But by month six, we're usually ahead. The other team is dealing with production fires, data quality issues, and user complaints that require redesigning things they thought were done. We've already solved those problems.

It's not more work. It's the same work, sequenced differently.

Specific Recommendations

If you're building an AI system, here's our tactical advice:

Spend at least a month on data quality before touching models. Invest in monitoring from day one—you should know immediately when the model's behavior changes. Build escape hatches: ways for users to override the model when it's wrong. Start with the simplest possible model (often: prompt engineering with a foundation model) and add complexity only when you have data proving it's necessary.

Don't optimize for demo impressiveness. Optimize for how quickly you can learn from production traffic and improve.

The Real Skill

After 14 deployments, we've concluded that successful AI engineering is mostly not about AI. It's about building robust, observable systems that can handle messy real-world inputs.

The teams that succeed are the ones that treat the model as one component in a larger system—not the thing that makes everything else work, but one thing that needs to work alongside everything else.

That's less glamorous than "we built an AI that can do X." But it's what actually ships.

Found this useful?

We write about engineering, product, and how to actually ship software. If you want to talk about any of this, reach out.

Start a Conversation