We're Built Different

Most AI projects fail.
We don't add to the pile.

Q: What are the red flags of a bad AI consultant?

The biggest red flags are prescribing solutions before diagnosing the business problem, pushing universal playbooks for any industry, making vague efficiency claims untied to specific analysis, heavy buzzword language, selling strategy without implementation, and being unavailable when something breaks in production.

80 percent of enterprise AI projects fail. The pattern is predictable. This page analyzes four named AI implementation failures (Klarna, IBM Watson, Volkswagen Cariad, McDonald's) and shows how engineer-led AI consulting plans around the reasons why.

80%

Enterprise AI projects that fail

RAND Corporation

95%

GenAI pilots that never scale to production

MIT NANDA, 2025

$4.2M

Average cost of a failed AI project

Industry average

2.1

Consulting teams cycled through per project

Average before completion

Four failures worth studying

Four enterprise AI implementation failures. Real money. Real lessons.

Each of these AI project failures is public record. Each one followed a predictable pattern. Each one could have been avoided with different engineering discipline at the start. Together they explain why most enterprise AI consulting engagements fail to ship to production.

Customer Service AIReversed after ~$40M projected savings vanished

Klarna replaced 700 agents with AI. Then reversed course.

Klarna · 2023 to 2025

What happened

Klarna deployed an OpenAI-powered chatbot to replace roughly 700 customer service agents. The CEO publicly claimed the AI did the work of 700 full-time employees. By mid-2025, customer satisfaction had dropped, repeat contact rates had spiked, and Klarna was rehiring human agents.

Why it failed

They measured the wrong metrics. Volume metrics (tickets resolved per hour, time to first response) looked excellent. Quality metrics (CSAT on disputes, fraud cases, and complex issues) were the actual failure point. The operational numbers masked the quality erosion until customer trust was already damaged.

How we plan differently

Operational metrics and quality metrics get baselined separately before anything ships. Both get tracked continuously. Edge cases are tested before production, not discovered in it. A quality metric trending down is a hard stop, even if volume metrics look fine.

Medical AI$62M spent. Never used on a real patient.

IBM Watson recommended unsafe cancer treatments.

IBM Watson and MD Anderson · 2013 to 2017

What happened

MD Anderson Cancer Center partnered with IBM to build an AI oncology advisor using Watson. After five years and $62 million, the project was canceled before it was ever used on a real patient. Internal documents revealed Watson had recommended drugs that could cause fatal hemorrhage for patients with severe bleeding.

Why it failed

Three failures stacked. The AI was trained on hypothetical patient scenarios, not real clinical data. The training data came from one hospital's practice, embedding bias directly into the model. And the system was built on top of unstructured, incomplete medical records it could not reliably parse.

How we plan differently

Data comes first. Before any AI work begins, the underlying data is audited for completeness, consistency, and production-reality match. Training uses real data, not synthetic. Scope stays narrow. One validated capability ships before the next one gets built on top.

Platform AIMulti-billion dollar writedown. Vehicle launches delayed by years.

Volkswagen tried to build everything at once. It did not ship.

Volkswagen Cariad · 2020 to 2025

What happened

Volkswagen launched Cariad to build one unified AI-driven operating system for all 12 VW brands. The plan included custom AI, proprietary silicon, and full replacement of legacy platforms. By 2025, it was publicly called automotive's most expensive software failure.

Why it failed

Big-bang scope. Everything was built in parallel, with each piece depending on the next before any of them had been validated in production. When one piece slipped, everything downstream slipped with it. There was no increment small enough to prove the approach before scaling it.

How we plan differently

Every engagement begins with a narrow, validated first deliverable. Scope expansion only happens after the first increment is measurably working in production. No "unified platform" engagements. No "build everything, ship later" bets. One capability at a time, proven, then expanded.

Voice AIProgram terminated across 100+ locations

McDonald's AI added 18,000 water cups to an order.

McDonald's and IBM · 2019 to 2024

What happened

McDonald's deployed AI voice bots to drive-thrus across more than 100 locations. Viral videos showed the AI adding 18,000 water cups to orders, misreading menu items, and failing at accents and background noise. The program was terminated in mid-2024.

Why it failed

The AI was tested in ideal conditions and deployed to a messy reality. Background noise, multiple speakers, accents, and impulsive corrections were the actual environment, not edge cases. And when the AI failed, it kept failing instead of escalating to a human or pausing.

How we plan differently

Adversarial testing runs before production deployment. Systems are tested against the messy reality they will actually operate in, not the clean demo environment. Graceful degradation is a day-one requirement. When confidence drops below threshold, the system escalates or pauses instead of failing loudly.

The pattern

Every enterprise AI failure traces back to the same root causes.

Four different industries. Four different use cases. The same handful of mistakes, repeated. These are the AI consultant red flags buyers should watch for.

Wrong success metrics

Volume-based metrics masked quality erosion until customer trust was already gone. Operational throughput is not output quality.

Fancy AI on weak data

Unstructured, incomplete, or biased data foundations cannot be fixed by more sophisticated AI on top. The foundation decides the ceiling.

Big-bang scope

Projects that try to replace or build everything at once have no increment small enough to prove the approach before scaling it.

No edge case testing

Systems validated only against ideal conditions fail the moment real-world messiness arrives. The edge case is the job, not the exception.

No graceful degradation

When the AI fails, what happens? In most failed projects, nobody designed the answer. The system fails loudly instead of escalating or pausing.

Consultants without skin in the game

Strategy without implementation. Roadmaps delivered, handoffs to junior teams, no one on the ground when something breaks in production.

How we plan

How we approach AI implementation consulting.

These principles guide every HelixAI engagement, from AI implementation consulting to Agentic Commerce Optimization. They exist because we have read the postmortems and we are not interested in adding our name to them.

Validation before shipping

Every deliverable passes a measurable quality bar before it moves forward. The bar is defined before the work starts. Nothing ships below threshold, regardless of deadline pressure.

Data foundation before AI layers

Before any AI work begins, the underlying data is audited for completeness, structure, and real-world accuracy. Weak data foundations cannot be fixed by better models on top.

Narrow first. Prove. Then expand.

No big-bang engagements. Every project begins with a narrow, validated first deliverable. Scope expands only after the first increment is measurably working in production.

Senior engineer on every engagement

You work directly with the principal engineer. No handoffs to junior staff. Three active engagements maximum per quarter to keep quality high. When something breaks at 2am, the person who built it is the one fixing it.

The interview

Questions to ask any AI consultant before hiring them. Including us.

If any AI consultant cannot answer these directly and specifically, walk away. This test applies to us too.

What specific business problem are we solving, in one sentence?

How will we measure success, and which metrics have you seen mislead in similar engagements?

What is the smallest possible first deliverable that would prove the approach works?

Who is actually doing the engineering work, and are they available when something breaks?

What is the edge case I should worry about, and how will you test against it?

What happens in month six if this is not working, and how would we know?

If a consultant answers these with buzzwords, efficiency percentages, or a pitch deck, you have found a salesperson, not an engineer.

The connection

How this connects to Agentic Commerce Optimization.

The same engineering discipline that prevents AI implementation failures is what powers HelixAI's Agentic Commerce Optimization work. ACO is the productized version of what we do for merchants and agencies: dimension-covering product content, six-gate validation, and feed submission to ChatGPT, Perplexity, and Google AI shopping surfaces.

Consulting

For companies whose AI readiness needs go beyond product discovery

Multi-team AI rollouts, custom multi-agent systems, team training, AI integration across existing systems. Engineer-led engagements. Three per quarter. Starting at $10,000. Learn about consulting.

ACO

For merchants and agencies that need products findable by AI shopping agents

Productized service for Shopify, WooCommerce, and BigCommerce catalogs. Measurable before-and-after match rates across real agent queries. Founding Partner pricing during early access. Learn about ACO.

Common questions

AI implementation consulting questions answered.

Direct answers to the questions buyers ask before hiring an AI consultant.

Why do most AI projects fail?

Research from RAND Corporation and MIT shows 80 to 95 percent of enterprise AI projects fail, and the reasons are consistent: wrong success metrics, fancy AI layered on weak data foundations, big-bang scope with no validated increments, no edge case testing, and consultants without skin in the game. Every named failure (Klarna, IBM Watson, Volkswagen Cariad, McDonald's) traces back to some combination of these root causes.

What are the red flags of a bad AI consultant?

The biggest AI consultant red flags are prescribing solutions before diagnosing the business problem, pushing universal playbooks for any industry, making vague efficiency claims (25 to 30 percent productivity gains) untied to specific analysis, heavy buzzword language, selling strategy without implementation, and being a part-time consultant with ten other clients who will not be available when something breaks in production.

What is engineer-led AI consulting?

Engineer-led AI consulting means the person you hire to advise you is the same person who builds and maintains the system. No handoffs to junior staff. No strategy decks without implementation. The engineer takes responsibility for the working system from kickoff through production, including being the one who fixes it at 2am if something breaks.

How much does AI implementation consulting cost?

HelixAI AI implementation consulting starts at $10,000 for fixed-scope projects. Retainer engagements start at $5,000 per month with a 90-day minimum. HelixAI takes three active engagements maximum per quarter to keep every engagement senior-engineer-led.

What should I ask an AI consultant before hiring them?

Ask them to define the specific business problem in one sentence, name the metrics that have misled in similar engagements, describe the smallest first deliverable that would prove the approach, identify who does the engineering work and their availability when something breaks, name the edge case that worries them most, and define what happens in month six if the project is not working.

How does HelixAI consulting relate to Agentic Commerce Optimization?

The same engineering discipline that prevents AI implementation failures is what powers HelixAI's Agentic Commerce Optimization (ACO) work. ACO is a productized service for merchants and agencies. Consulting is for companies whose AI readiness needs go beyond product discovery. Both share the same underlying validation pipeline, data audit protocols, and narrow-first scoping approach.

If this is how you want AI built, let's talk.

Three active engagements per quarter. Starting at $10,000. By application only.

Request a conversation Learn about ACO

Most AI projects fail.We don't add to the pile.

Four enterprise AI implementation failures. Real money. Real lessons.

Klarna replaced 700 agents with AI. Then reversed course.

What happened

Why it failed

How we plan differently

IBM Watson recommended unsafe cancer treatments.

What happened

Why it failed

How we plan differently

Volkswagen tried to build everything at once. It did not ship.

What happened

Why it failed

How we plan differently

McDonald's AI added 18,000 water cups to an order.

What happened

Why it failed

How we plan differently

Every enterprise AI failure traces back to the same root causes.

Wrong success metrics

Fancy AI on weak data

Big-bang scope

No edge case testing

No graceful degradation

Consultants without skin in the game

How we approach AI implementation consulting.

Validation before shipping

Data foundation before AI layers

Narrow first. Prove. Then expand.

Senior engineer on every engagement

Questions to ask any AI consultant before hiring them. Including us.

How this connects to Agentic Commerce Optimization.

For companies whose AI readiness needs go beyond product discovery

For merchants and agencies that need products findable by AI shopping agents

AI implementation consulting questions answered.

If this is how you want AI built, let's talk.

Most AI projects fail.
We don't add to the pile.