How to Evaluate an AI Agent Platform Before You Commit to One

88% of AI agent pilots fail to reach production. Here's the structured evaluation framework CTOs need before choosing an AI agent platform.

June 22, 2026 · ~9 min read · Auxot Team

A Digital Applied analysis of 120+ enterprise AI agent deployments found that 88% of pilots never reach production. The platform wasn’t always the reason — but it was frequently a contributing factor. Teams ran evaluations that prioritized demo polish over structural fit, then discovered — after contracts were signed and workflows built — that they were locked into a vendor whose data handling, pricing model, or governance tooling didn’t actually fit how they work.

This post is a framework for running a real evaluation before you commit: the questions to ask, the answers that actually differentiate platforms, and the failure modes that only surface after you’re six months in.

Why Evaluations Fail

The standard enterprise evaluation process looks like this: watch a demo, run a feature checklist, do a two-week trial, pick the option that looked cleanest. That process is optimized for the vendor’s benefit, not yours.

The questions that matter — where does your data actually live, what happens when you need to switch models, what do your audit logs look like in a compliance review, what’s the realistic cost at scale — rarely surface in vendor conversations. Vendors know how to make their platform look good on the dimensions they control. The structural questions require you to push past the demo.

The market is also crowded in a way that makes comparison harder. A Vellum analysis published this month listed 13 enterprise AI agent platforms. A separate piece from Totalum ranked 12 tools for production builders. Every one of them claims to be enterprise-grade and production-ready. Most of them are optimized for one specific context — a GCP shop, a developer-led team, a no-code buyer — and wrong for everyone else.

The framework below gives you a structured way to filter fast.

The Six Questions That Matter

1. Where does your data actually live?

This is the first question, not a secondary one. When you send a query to an AI agent platform, that query passes through a routing and governance layer before it reaches the model. The question is: whose infrastructure is that layer running on?

Most platforms are SaaS-first: your queries, context, agent configurations, and conversation history live on the vendor’s servers. For a SaaS startup building an internal chatbot, that’s often fine. For a healthcare organization, a law firm, a financial services company, or any team handling data under a confidentiality agreement, it’s a structural blocker — and vendors will spend your entire demo avoiding saying so directly.

The model call itself is a separate question. Whether you’re using Claude, GPT-4, or a local model, the inference call goes to the model provider. But the governance layer — routing, logging, agent config, context storage, conversation history — is distinct from the model call, and it needs to live somewhere. Ask specifically: “Where does the governance layer run? Can we self-host it?”

If the answer is “everything is in our cloud,” verify that answer is compatible with your data handling requirements before proceeding.

2. What are the lock-in vectors?

Vendor lock-in in AI platforms comes in three forms. Most evaluation processes catch one and miss the other two.

Model lock-in: Your agents are hard-wired to a single model provider. If Anthropic raises prices, changes rate limits, or deprecates a model version, your options are limited. A well-designed platform treats model routing as configuration — you specify which model each agent calls, and changing it requires no code changes and no vendor support ticket. Ask vendors directly: “If I want to move my data-processing agents from Claude to GPT-4o while keeping my client-facing agents on Claude Sonnet, show me how that routing configuration works.”

Runtime lock-in: Your agent definitions, prompt configurations, context files, and workflow logic are stored in a proprietary format. If you need to migrate, you’re rebuilding from scratch. Ask: “Can I export my agent definitions in a portable format? What does migration off your platform look like?”

Infrastructure lock-in: The platform runs only on the vendor’s cloud. There’s no self-hosted option. This one has compounding consequences: it limits your data residency options, gives the vendor full pricing leverage at renewal, and means your audit surface extends to their security posture, not just your own.

None of these lock-in vectors are automatically disqualifying — but they should all be explicit in your decision, not discovered afterward.

3. What does governance actually look like?

Every platform claims governance features. The specifics matter far more than the label.

Logging: Are all model calls logged? Are prompts and completions stored? Where, for how long, and who can access them? Ask the vendor to walk you through how you’d pull the logs for a specific agent conversation from 60 days ago during a compliance audit. Not a hypothetical answer — an actual demonstration of the UI or API endpoint you’d use. If they can’t show you that in the demo, the logging isn’t production-ready.

Access control: Can you restrict which agents a given user or team can run? Can you set per-team model allowlists — so your customer-facing agents can’t call the most expensive model unless explicitly configured? Can you revoke access at the agent level, not just the account level?

Cost visibility: Do you have per-agent, per-user cost breakdowns? Can you set spend limits that halt agent execution — not just generate an alert — when a threshold is hit? The runaway agent incidents from earlier this month (an agent that bankrupted its operator scanning a network range; a financial agent compromised in a €0.01 test transaction) both share a common root cause: no operational circuit breakers.

4. What is the realistic operational burden?

Self-hosted and cloud-hosted platforms carry fundamentally different operational profiles. The right choice depends on your team’s capacity, not a universal best practice.

A SaaS platform eliminates infrastructure management but makes the vendor’s uptime, security practices, and update cadence your problem by proxy. A self-hosted platform gives you full control but puts the update, backup, and scaling responsibility on your team.

The question isn’t “hosted or self-hosted?” — it’s “what does realistic maintenance look like for a team of our size, and do we have capacity for it?”

A self-hosted platform that requires a dedicated DevOps resource to maintain has a real cost. A SaaS platform that requires your security team to review every vendor update has a real cost. Both are legitimate tradeoffs, but both need to be named explicitly in your evaluation.

Ask the vendor: “Walk me through what a typical maintenance week looks like for a team of our size. What breaks, what requires manual intervention, what can we automate?“

5. What does the cost structure look like at scale?

Vendor pricing is optimized to look cheap at demo time and expensive at scale. The patterns to watch for:

  • Per-seat pricing that multiplies when any new team member needs access — a platform that looks affordable at 10 users hits real money at 100
  • Credit systems where cost per call is opaque until you’re mid-month and already over budget
  • Inference costs bundled into the platform price, removing your ability to optimize model routing or negotiate directly with model providers
  • SaaS subscription fees that compound on top of the model API costs you’re already paying

The evaluation question: “Walk me through what our monthly invoice looks like when we have 50 agents running across 100 users, each making 200 calls per day.” Get a specific number. If they can’t give it, the pricing model is designed to obscure cost until after you’re committed.

6. What happens when things go wrong?

Most platforms look fine on day one. The failure modes show up when something breaks — an agent misfires on a production task, a compliance audit surfaces a data question, a key employee leaves, a vendor changes pricing or announces a shutdown.

On failures: How do you roll back an agent that’s behaving incorrectly? Is there version history on agent configurations? What does incident response look like if there’s a data exposure on the vendor’s side?

On exits: How do you migrate off the platform? What’s the data export format? Are there contractual lock-in clauses — minimum terms, data deletion timelines, portability limitations?

The Fable AI post-mortem from last week (a well-regarded AI startup that shut down, taking its customers’ work with it) is a useful reminder that exit planning is part of platform evaluation, not a worst-case scenario you can ignore.

The Questions That Cut Through Demos

Vendor demos are optimized to avoid awkward moments. These specific questions surface the real answers:

“Show me the audit log for a specific agent conversation from last week. Walk me through exactly where that data is stored and how I’d pull it for a compliance review.” Watching the vendor navigate this in real time tells you more than any feature list.

“We want to route our data-processing agents to GPT-4o and keep our client-facing agents on Claude Sonnet. Show me exactly how that routing configuration looks.” If they need to schedule a follow-up call to answer this, model flexibility is not real.

“If we stop paying tomorrow, what data do we get back and in what format?” The answer tells you everything about how the vendor thinks about portability.

“We’re going to send you a list of our model call costs from the last 30 days. Show us what those costs would look like on your platform.” Forces a concrete cost comparison on your actual usage, not their benchmark scenario.

“Our legal team needs to sign off on data handling. Send us your data processing agreement and tell us where our queries are stored and for how long.” If they can’t produce a DPA quickly, data handling is not an engineering priority for them.

What This Framework Eliminates

Running through these six questions systematically will eliminate most platforms quickly. SaaS-only platforms fail the data residency question for compliance-constrained buyers. Platforms without model flexibility fail the lock-in question. Platforms with opaque logging fail the governance question. That’s fine — the market is crowded, and most options are wrong for most buyers.

The goal isn’t a comprehensive comparison of every platform on every dimension. It’s a structured way to stop doing evaluation theater and start asking the questions that determine whether a platform will work for you in production — not just in a demo.


If you’re in evaluation mode, Auxot is designed to answer every question in this framework directly: self-hosted governance layer, configurable model routing, full audit logs, access control, and operational overhead built for small technical teams. You can deploy it and run the evaluation yourself — no sales call required. Get started at /install or work through the setup tutorials at /tutorials.