Skip to main content
Back to blog
· 13 min read
The chasm between the demo and production

The chasm between the demo and production

Why most AI projects look simple in the demo and take 10x longer in real life, and what changes when you are the executive at the wheel.

artificial intelligence leadership software engineering technology executive strategy

In February of this year, a Cloudflare engineer made a headline that traveled the world. In one week, with help from Claude and roughly US$ 1,100 in tokens, he reimplemented 94% of the Next.js API from scratch. He called it vinext. Builds 4x faster. Bundles 57% smaller. One engineer. One AI model. Seven days.

Anyone who read the title walked away impressed. Anyone who read the end of the official post found a different story. Cloudflare itself spells it out: the project “is experimental, isn’t even one week old, and has not yet been battle-tested with any meaningful traffic at scale”. And it recommends that, if you’re evaluating it for production, you “proceed with appropriate caution”. Features are missing. Static pre-rendering is missing. The track time that separates an experiment from a product is missing.

The vinext story is genuinely impressive. But it’s also the cleanest case I’ve seen of the pattern that’s dominating the conversation about AI: the part AI accelerates is the easy part. The hard part is still hard. And executives, seeing only the headline, are making decisions that ignore that chasm.

The illusion of simple

Every AI demo lives in a visible layer. In a conversational agent, it’s the smooth chat in the messaging app. In a coding copilot, it’s the smart autocomplete on screen. In a framework rebuilt from scratch, it’s “look, the homepage runs”. That layer is polished, charming and, in 2026, genuinely excellent. It makes the C-level lean forward in their chair.

The problem is that this visible layer is usually between 5% and 10% of the actual project. The other 90% lives below the waterline, in things that don’t fit in a demo: integration with legacy systems, scale, observability, predictable cost, security, compliance, rollback, quality policy, on-call team, SLA contract. Boring things. Expensive things. Things AI, today, barely accelerates.

When the executive only saw the top, they ask the wrong question. “If it already exists, why is it slow?” “If a vendor delivers for the neighbor, why would it be different for us?” “If one engineer built Next.js in a week, why does my team need six months for a WhatsApp agent?” These questions have answers. But the answer doesn’t fit in the demo, and that’s why no one wants to hear it.

The honest, slightly ironic answer is: because you’re measuring the iceberg by the tip.

The example that runs through the piece: a WhatsApp agent at SMB and at a large network

I’ve spent the last year talking to a lot of technology executives. In Brazil, with leaders of companies at very different sizes, from education to retail, from healthcare to industry. Outside of it, on a recent trip to Silicon Valley, talking to people who run platforms at global scale. In almost every one of those conversations, the conversational agent example comes up at some point. It’s where the chasm between demo and production becomes most visible. I’ll use that case to illustrate.

Picture two companies. The first is a small school that wants an agent to qualify leads on WhatsApp and book a visit. The second is a network of hundreds of branches that wants the same agent, now serving the entire country, integrated with the central enrollment system, the corporate CRM and the ERP.

For someone watching the screen, the product is the same. The lead sends a message, the bot replies, qualifies and books. Identical demos. The visible layer is practically the same.

The layer underneath scales non-linearly.

DimensionSMBLarge network
Monthly volume50 to 500 leads50,000 to 500,000 leads
WhatsApp providerUnofficial APIMeta’s official Cloud API or enterprise BSP
Cost per conversationFlat monthly feeUS$ 0.06 to 0.12 per conversation, linear with volume
AI costUS$ 5 to 50 per monthUS$ 1,500 to 10,000 per month
Time to MVP4 to 6 weeks, 1 dev8 to 12 months, team of 4 to 8
SLA”Just make it work”99.9% contractual uptime with penalty
ComplianceBasic LGPD/GDPRDPO, DPIA, audit trail, right to erasure

Not a single line of that table shows up in a 20-minute demo.

The real complexity sits in three places executives rarely see.

First, WhatsApp. Unofficial providers work up to about a thousand conversations per month. At commercial volume, Meta detects the automated pattern and bans the number permanently, taking the whole operation down with it. The enterprise solution is the official WhatsApp Business Platform, with each message template approved by Meta (hours to a week per approval), with linear cost per initiated conversation, with the account under a watched quality policy. Migrating from one to the other isn’t switching an environment variable. It’s redoing the flow, with a window of instability.

Second, integration with internal systems. At an SMB, integration is one REST call to the CRM, two days. At a corporation, integration is the project. SAP ERP with custom SOAP adapters. Java legacy CRM with no current documentation. Enrollment system built in 2008, maintained by a single person who’s leaving in three months. Operations dependent on local Excel sheets. Each of those bridges takes 2 to 8 weeks, with constant blockers waiting on the customer’s internal team. In typical projects, 70% of the technical effort is integration, not AI.

Third, AI cost scales non-linearly when poorly designed. Without prompt caching and without model routing, an operation of 100,000 leads per month blows past US$ 5,000 in AI bills. With the right optimizations (the fixed part of the prompt cached, a cheap model for triage, a big model only for complex cases, a summary of long history), it drops to US$ 1,500. At an SMB, that gain doesn’t pay for the effort. At an enterprise, not doing it is a bad financial decision, and it shows up at the quarterly close, usually with the CFO on the phone.

The most expensive intuition I keep hearing in technology committees, across different tables, is the same: if a vendor delivers this for a small company, doing it for a large network should be similar. It isn’t. It’s typically a project 5 to 10x longer in timeline and 10 to 20x higher in cost, with a completely different risk profile. And that difference isn’t in the chat the lead sees. It’s in everything that holds the chat up.

Why the confusion is structural, not naivety

I want to be clear: the demo is genuinely impressive. It’s not theater, it’s not a smokescreen. It’s the state of the art of the visible layer, and whoever built it did good work. The problem isn’t the demo. The problem is what we project from it.

AI has evolved faster than the tooling around it. Today we have spectacular models, but we’re still crawling on a few critical points.

Tests for non-deterministic systems. A prompt that worked yesterday may degrade tomorrow without anyone noticing. Eval frameworks are still young, and almost no product team has the maturity to use them well.

Prompt observability in production. How many conversations went off the rails last week? How many were hallucinations? How many ended with an angry customer? Honest answer for 9 out of 10 operations: nobody really knows.

Data security in RAG. A sensitive document indexed into a vector store can leak through a well-crafted prompt injection. The controls for this are still manual and depend heavily on the implementer.

Predictable cost at scale. Most projects discover the real cost only after the first monthly close, because modeling token consumption in real scenarios is hard, and providers change pricing without notice.

Behavior rollback. You changed the prompt and quality dropped? Can you go back to the exact previous state, with the same result, in production? Almost no one can.

We’re in the middle of the journey. The tools are coming: more robust eval frameworks, semantic tracing, prompt governance, sandboxing, stable prompt cache standards. But they aren’t yet mature like, say, a well-built CI/CD pipeline for traditional services. Anyone using only the state of the art of the model without the rest of the stack is building on soft ground, and the headline only tells you about the part of the ground that looks firm.

Back to vinext: what the headline doesn’t tell you

The Cloudflare case is a concentrated version of the same trap.

Reimplementing 94% of the API of a dominant framework in a week, with US$ 1,100 in tokens, is a real milestone. It shows the cycle of “explore architecture, generate code, run tests, iterate” got an order of magnitude faster than it was in 2023. That is true and relevant.

But vinext, today, doesn’t run everything that Next.js runs. It doesn’t have static pre-rendering. It hasn’t been tested at real traffic scale. Cloudflare itself says, on the same page, that if you’re considering production, you should “proceed with appropriate caution”. And that’s what Cloudflare got right: owning the experimental status publicly.

The road from “the homepage runs” to “it runs production around the world” is the actual work. That road didn’t take a week. It will probably take many months, maybe years, and it will require everything AI doesn’t accelerate: covering the 6% that’s missing (always the gnarliest), finding and fixing hundreds of regressions in edge cases, maintaining compatibility with third-party plugins, supporting production at thousands of different customers, building community, documenting, providing support.

Same trap as the conversational agent: the part AI accelerates is the easy part. The hard part is still hard.

Driving an F1 on a track you don’t know

The image I keep using to describe this feeling is driving a Formula 1 car on a new track that’s still full of potholes.

As a technology executive in 2026, the car is absurd. Acceleration that was impossible two years ago. A small team today delivers what a big team delivered before. Demos in hours. Prototypes in days. Whole systems in weeks. The feel on the throttle is one of superpower, and that’s not exaggeration, it’s description.

The track, however, is less familiar. The emerging playbooks I mentioned above are evolving fast, but none of them yet has the maturity that traditional software engineering accumulated over decades. In practice, every project becomes an exercise in adaptation: tuning the test for a system that isn’t deterministic, stitching observability between prompts and logs, setting the spending limit before token consumption blows past budget, deciding how to roll back when what changed was a prompt, writing the incident protocol for the moment the model hallucinates. You can do it well, and it gets a little easier each quarter. But it still demands active awareness at every decision, because the off-the-shelf shortcuts from traditional engineering haven’t been replicated here yet.

And there are still potholes. Regulation is still in motion (LGPD, AI Act, sector-specific rules). Vendors changing prices from one week to the next. Models being deprecated before you finish the integration. Prompt cache standards changing. A context window that was impossible yesterday becoming the default today, and rearranging your architecture.

Driving slow wastes the car. You let competitors pass and miss the real competitive window AI opened.

Driving fast without reading the track is a crash. Data leaks, WhatsApp operation banned, AI bill exploding, LGPD fine, lawsuit, headline in the press.

The job of a technology executive, today, is to read the track while accelerating. It isn’t choosing between speed and safety. It’s holding both at the same time, with a cool head, knowing the tools that would make this comfortable are still being built.

What a competent executive does in this moment

A few things I’ve been applying and have seen work.

Question the first intuition. “Looks simple” became a red flag, not comfort. Every time a project looks simple, I ask: what’s below the waterline? If no one can answer, it’s because no one looked.

Separate visible from invisible in every budget. A pretty demo says nothing about infrastructure, integration, governance. I explicitly ask, of the team and of vendors: how much will observability cost? How much will the on-call cost? How much will integration with each internal system cost? How much will it cost to keep this running when the model is deprecated? How much will the audit cost? Those numbers either show up or they don’t. If they don’t, the budget isn’t ready.

Ask for the numbers on real volume, on the systems that connect, on the SLA committed, on the cost at scale. Anyone answering “it depends” without ever having done it is bluffing. Anyone answering with concrete ranges, showing prior experience at that scale, is selling the actual work.

Reserve budget for the invisible part. In an enterprise conversational agent, observability, compliance and integration cost more than AI itself. Accepting this at the approval phase is what separates the project that delivers from the project that becomes a nightmare six months in.

Accept that doing it right takes time. It’s not inefficiency, it’s the price of not crashing. An enterprise project of 8 to 12 months isn’t three SMB projects stitched together. It’s a different project, with different complexity and different risk. I’ve written about this from another angle, looking at the risk of confusing individual speed with real productivity.

Distinguish experiment from product. Cloudflare did the right thing by calling vinext experimental. Many AI vendors don’t make that distinction. They sell the experiment as the product, and when the first serious problem hits production, the customer is alone on the track.

The question that separates who delivers from who presents

The right question, in a technology committee today, isn’t “can they build an AI agent?” or “can they reimplement this in a week?”. The visible layer got cheap, and almost everyone can build some version of it.

The right question is “can they operate this at scale, under SLA, integrated with 12 systems, in compliance with 4 regulations, with continuous governance, with clear rollback, with predictable cost, with an incident plan, with a continuity plan?”. That second question filters out 95% of the vendors showing up in this market, and it should filter a good chunk of internal decisions too.

We’re living through the best moment in history to build software. The window is real, and closing it through excess caution would waste a generational advantage. But the work isn’t showing the demo. The work is shipping production.

Accelerating is easy. Accelerating safely, knowing where the potholes are, is the work.


If this topic interests you, I’d love to exchange ideas. Find me on LinkedIn.

Join my Newsletter

Thoughts on technology leadership, AI, and education. Straight to the point, no fluff.

No spam. Unsubscribe anytime.