Looking for DD services or software?Beyond M&A →Lens →
Pillar guide · 11 min read

Tech Due Diligence on an AI Startup

How to run technical diligence on an AI-first startup. Training data, model moat, inference economics, vendor lock-in, and the questions the seller will dodge.

Venture CapitalCorporate DevelopmentStrategic Buyer
B·M

Written by The Beyond M&A team

Practitioners across Tech DD, integration, and AI-native deal tooling

Last reviewed 20 May 2026

How we research

Executive summary

AI startups invert the classic diligence ranking. Code quality matters less; training data lineage, model dependencies, and inference unit economics matter more. This guide is the playbook our advisors use when the target's competitive moat is the model, not the codebase.

  • 01The model is rarely the moat. Training data, evaluation infrastructure, and inference cost discipline usually are.
  • 02Vendor lock-in on a single foundation-model provider is the most common Day-1 surprise post-close.
  • 03Customer data used for training without explicit licence is the most common Repricing Event.

Diligence on an AI-first company looks superficially like diligence on any SaaS company: same architecture review, same security pass, same engineering interviews. The differences are quiet but expensive.

What's actually being acquired

Most AI startups, when you decompose them, fall into one of three categories. Wrappers take a foundation model and apply prompt engineering plus a workflow surface. Distillers fine-tune an open-weights model on proprietary data. Builders train their own models from scratch. The diligence priorities are radically different across the three; before any work begins, classify the target accurately.

A wrapper's value is the workflow, the data, and the distribution. A distiller's value is the dataset and the training pipeline. A builder's value is the team, the compute access, and the evaluation harness. Buying a wrapper at builder multiples is the most common overpay in this market.

Training data lineage

For each dataset that touched a deployed model, the seller should produce:

  1. Source — open dataset, scraped corpus, licensed corpus, or customer data.
  2. Licence — the legal basis for the company's use of the data.
  3. Customer consent — for any customer data, the contractual clause that permitted its use in training.
  4. Reproducibility — whether the dataset is versioned and the training run is reproducible from it.

A target that produces this in 48 hours has its house in order. A target that produces it in two weeks of back-and-forth almost always has gaps that surface in post-close litigation.

Inference economics

Pull the last six months of foundation-model invoices (OpenAI, Anthropic, Google, etc.) and divide by active-user counts. Then ask the engineering team for the per-feature breakdown — which product features generate which share of inference cost. Mismatches between feature usage and feature cost reveal the unit-economics traps.

Vendor lock-in

What happens if the seller's primary foundation-model provider raises prices 3×? Triples a rate limit? Discontinues a model? A healthy AI product can route to a second provider with a feature-flag flip and re-evaluate quality automatically. Most cannot. This is the most common post-close cost surprise.

Frequently asked

Is open-source model usage a risk?+

Sometimes. Llama, Mistral, and Qwen models carry usage clauses that vary by version and by commercial scale. Read the licence for each model in production, not just the model family.

What's the right way to test the model in diligence?+

Build a private evaluation harness against the target's own benchmark set, then re-run it on a leading competitor model with the same prompts. The delta tells you whether the moat is the model, the prompts, or neither.

If you're reading this as…

Related guides

Further reading on our network

Lens · Live demo

See Lens against your live data room

30-minute working session. We'll mirror a redacted slice of your own files and walk the AI Q&A, redaction and indexing flows.

We keep your details on file solely to respond. No marketing list.