In 2014, the DeVos family invested $100 million in Theranos. Lisa Peterson, who managed their investments, later testified that she never visited any of Theranos’s testing labs, didn’t speak with any Walgreens officials, and didn’t hire outside experts to verify the startup’s claims.

She wasn’t alone. Investor after investor put millions into Theranos without verifying whether the technology actually worked. Dan Mosley asked for audited financials—there were none—and invested $6 million anyway. Bryan Tolbert admitted the Hall Group invested $5 million despite not understanding the technology.

The result: a $9 billion valuation based on nothing. When the fraud unraveled, investors lost everything.

Theranos wasn’t even an AI company. For actual AI investments, due diligence is harder—the technology is complex, performance claims are easy to inflate, and data moats are difficult to verify. But it’s also more important.

Why AI Due Diligence Is Different

Traditional technical due diligence focuses on code quality, architecture, and scalability. AI adds new dimensions:

  • Model performance: Does the AI actually work as claimed, or just on cherry-picked demos?
  • Data moats: Is the data advantage real and defensible, or easily replicated?
  • Technical debt: AI systems accumulate debt differently than traditional software—models drift, pipelines rot, and “temporary” workarounds become permanent.
  • Team depth: AI requires specialized skills that are hard to evaluate without expertise.

Gartner predicts that by 2025, more than 75% of VC executive reviews will use AI and data analytics. But AI tools for due diligence don’t replace domain expertise—they augment it. You still need someone who knows what questions to ask.

The Checklist

1. Validate the Claims

Key Questions:

  • What specific claims are being made about AI capabilities?
  • What metrics support these claims?
  • Can we independently verify performance?

Red Flags:

  • Vague claims (“our AI is highly accurate”) without numbers
  • Performance only demonstrated on curated demos, not production data
  • Cherry-picked examples that don’t represent typical use cases
  • Unwillingness to share evaluation methodology
  • The “Theranos pattern”: impressive presentations but no independent verification

What to Request:

  • Clear performance metrics on held-out test data
  • Methodology for calculating those metrics
  • Comparison to relevant baselines (how much better than simple approaches?)
  • Sample of model outputs for your own manual review

The Harvard Law School Forum noted that Theranos “sold exaggerated and outright false claims” while investors failed to verify basic facts. In AI companies, demo performance almost never matches production performance—always ask for production metrics.

2. Assess the Data

Key Questions:

  • What data does the AI use?
  • Where does it come from?
  • Is it legally obtained and usable?
  • How defensible is the data advantage?

Red Flags:

  • Data sourced through questionable means (scraping terms-of-service violations, unclear licensing)
  • Small datasets for complex problems (deep learning typically needs millions of examples)
  • No data quality processes
  • Single-source dependency
  • Like Theranos: The FDA never approved their equipment, but investors didn’t check.

What to Request:

  • Data inventory and sources
  • Data licensing agreements
  • Data quality metrics and processes
  • Data collection and labeling procedures

The most common AI startup failure mode is running out of quality data to improve. Assess whether they can keep collecting what they need.

3. Evaluate the Models

Key Questions:

  • What type of models are used?
  • How were they trained and validated?
  • What are the known limitations?
  • How often are they retrained?

Red Flags:

  • Can’t explain why they chose their model architecture
  • No validation on held-out data (trained and tested on the same examples)
  • No monitoring for model drift
  • Models only exist in Jupyter notebooks, not production code
  • Overfitting to specific test cases
  • “We use GPT-4” as the primary differentiator (that’s not a moat—anyone can call the OpenAI API)

What to Request:

  • Model architecture documentation
  • Training and validation procedures
  • Performance breakdown on different segments/use cases
  • Production monitoring dashboards

4. Review Code and Architecture

Key Questions:

  • Is the codebase maintainable?
  • Can the team ship updates quickly?
  • What’s the ML infrastructure maturity?

Red Flags:

  • Models only exist in notebooks, never productionized
  • No CI/CD for ML code
  • No model versioning (can’t reproduce previous results)
  • No reproducible training (results vary each time)
  • Single person dependency (“only Alice understands the model”)

What to Request:

  • Code repository access
  • Architecture documentation
  • Deployment procedures
  • Development workflow overview

5. Assess the Team

Key Questions:

  • Does the team have real ML expertise?
  • Can they maintain and improve the system?
  • What’s the bus factor?

Red Flags:

  • Team talks about AI in buzzwords, not specifics (anyone can say “neural network”)
  • No one with production ML experience
  • Key ML contributors recently left
  • Over-reliance on contractors or agencies (fine for building, concerning for maintaining)

What to Assess:

  • Team backgrounds and publications (if applicable)
  • Ability to explain technical decisions and trade-offs
  • Evidence of independent problem-solving (not just following tutorials)
  • Hiring pipeline for ML roles

ICAEW analysis of Theranos noted that “gathering and validating objective data to back up management claims” should have been basic due diligence. In AI companies, ask the engineering team to explain their architecture—if they can’t, that’s a problem.

6. Check Scalability and Operations

Key Questions:

  • Can the system handle 10x more users/data?
  • What’s the operational burden?
  • What’s the cloud cost trajectory?

Red Flags:

  • No load testing for ML endpoints
  • Inference costs not well understood (GPU bills can explode)
  • No plan for scaling data pipelines
  • Manual processes for model updates (fine for MVP, concerning for scale)

What to Request:

  • Architecture diagram with scale considerations
  • Current and projected infrastructure costs
  • Operational runbooks
  • Incident history

7. Evaluate IP and Competitive Moat

Key Questions:

  • What’s actually proprietary?
  • How easy would it be to replicate?
  • What’s the sustainable advantage?

Red Flags:

  • Primary differentiator is access to foundation models anyone can use
  • No unique data, methodology, or domain expertise
  • Easy-to-replicate feature engineering
  • No patents, trade secrets, or defensible processes

What to Assess:

  • Unique data assets
  • Proprietary algorithms or training techniques
  • Domain-specific innovations that took time to develop
  • Customer switching costs

Scoring Framework

We rate each area 1-5:

Score Meaning
5 Excellent—industry best practices
4 Good—minor improvements possible
3 Adequate—no critical issues
2 Concerning—significant gaps
1 Critical—fundamental problems

Minimum Thresholds:

  • No areas should be a 1 (any critical failure is a red flag)
  • Average 3+ for early-stage companies
  • Average 4+ for later-stage or acquisitions

Common Due Diligence Mistakes

Accepting Demo Performance as Production Performance

Demos are optimized to impress. Research shows that over 80% of AI projects fail, often because reality doesn’t match initial demonstrations. Always ask for production metrics on real user data.

Not Testing Edge Cases

AI systems often fail on edge cases. Request performance breakdowns by segment, input type, and difficulty. Ask what happens with unusual inputs.

Underestimating Data Dependencies

The most common AI startup failure mode is running out of data. Can they keep collecting quality training data? What happens when their data source changes terms?

Ignoring the Human Element

Many “AI” systems are heavily human-augmented. Ask what percentage is actually automated versus human-in-the-loop. Neither is bad, but you should know what you’re buying.

Proxy Due Diligence

VC Factory analysis identifies “proxy due diligence”—relying on other investors’ checks—as a primary cause of failures. Just because a famous VC invested doesn’t mean the technology works. Theranos had name-brand investors who all assumed someone else had done the technical verification.

Not Involving ML Experts

Traditional software engineers often miss AI-specific issues. Include someone with production ML experience in the evaluation. The cost of missing a critical issue far exceeds the cost of expert evaluation.

Sample Report Structure

  1. Executive Summary
    • Overall assessment
    • Key strengths
    • Key risks
    • Investment recommendation
  2. Claims Validation
    • Summary of claims made
    • Evidence reviewed
    • Status: Verified / Partially Verified / Unverified / Disputed
  3. Technical Assessment
    • Data quality and defensibility
    • Model performance and methodology
    • Architecture and code quality
    • Operational maturity
  4. Team Assessment
    • Relevant experience
    • Key person dependencies
    • Gaps and hiring needs
  5. Risk Register
    • Identified risks
    • Severity and likelihood
    • Mitigation recommendations
  6. Appendices
    • Detailed findings
    • Supporting evidence
    • Interview notes

When to Get Outside Help

Consider third-party technical due diligence when:

  • The investment is significant (€1M+)
  • AI is core to the value proposition
  • Your team lacks deep ML expertise
  • The target company seems evasive about technical details
  • You’ve seen too many AI demos that didn’t match reality

The lessons from Theranos, Wirecard, and FTX are clear: impressive presentations and name-brand backers don’t replace actual verification. For AI companies, that verification requires technical expertise.


Need help evaluating an AI company? Get in touch for independent technical due diligence.