TWINLADDER
TwinLadder logoTwinLadder
Back to Insights

Tool Evaluations

Evaluating Legal AI Tools: A Due Diligence Framework from the Engineering Side

Stanford says 17-33% hallucination rates. 41% of tools have significant security weaknesses. If you are buying legal AI without a systematic evaluation framework, you are gambling with client data and professional reputation.

2025. gada 10. jūnijsEdgars Rozentāls, Līdzdibinātājs un tehniskais direktors16 min read
Evaluating Legal AI Tools: A Due Diligence Framework from the Engineering Side

Evaluating Legal AI Tools: A Due Diligence Framework from the Engineering Side

Stanford says 17-33% hallucination rates. 41% of tools have significant security weaknesses. If you are buying legal AI without a systematic evaluation framework, you are gambling with client data and professional reputation.


I evaluate technology systems for a living. I have done it across industries for twenty years. And I can tell you that legal AI procurement, as currently practiced by most firms, would not pass muster in any other regulated industry.

The pattern I see: a partner watches a demo, the marketing department sends a case study, someone negotiates pricing, and the tool goes live. Maybe there is a pilot. Usually the pilot tests whether lawyers like the interface, not whether the outputs are reliable.

This is backwards. Let me walk you through how an engineer evaluates these systems.

Starting with the Hallucination Data

The Stanford Legal RAG Hallucinations study, published in the Journal of Empirical Legal Studies in 2025, provides the most rigorous assessment available. The findings are important enough to memorize:

  • Lexis+ AI: 17-33% hallucination rate
  • Westlaw AI-Assisted Research: 17-33% hallucination rate
  • Ask Practical Law AI: 17-33% hallucination rate

The researchers defined hallucination as "a response that contains either incorrect information or a false assertion that a source supports a proposition." This was the first preregistered empirical evaluation of AI-driven legal research tools.

For context: before this study, LexisNexis claimed Lexis+ AI delivered "100% hallucination-free linked legal citations." A Thomson Reuters executive asserted that retrieval-augmented generation "dramatically reduces hallucinations to nearly zero."

The Stanford researchers' conclusion: "The providers' claims are overstated."

If you take one thing from this article, take this: vendor claims about accuracy and hallucination rates are marketing, not engineering specifications. Verify independently.

Two Types of Hallucinations

The Stanford study identifies two distinct failure modes, and understanding both is critical for evaluation.

Incorrect information: the AI describes the law incorrectly or makes factual errors. This is the obvious failure mode. The citation does not exist. The holding is wrong. The statute says something different.

Misgrounding: the AI describes the law correctly but cites a source that does not support the claim. This is the more dangerous failure mode. A reviewer who verifies that the legal statement seems correct may not independently confirm that the cited source actually supports it. The law is right, the citation is real, but the connection between them is fabricated.

Your evaluation methodology needs to test for both. If you only check whether citations exist, you will miss misgrounding errors entirely.

The Sycophancy Problem

Here is a failure mode that most evaluation frameworks miss: AI tools tend to agree with user assumptions, even when those assumptions are incorrect.

In practical terms, a lawyer who believes their client has a strong argument may prompt the AI in ways that confirm this belief. The AI generates supporting analysis, not because the analysis is correct, but because the system is optimized to be helpful and agreeable.

This is a known behavior in large language models. It is called sycophancy, and it means that biased prompting produces biased output. Your evaluation should include adversarial testing — deliberately including incorrect premises in prompts to see whether the tool corrects errors or reinforces them.

If the tool agrees with everything you say, it is not reliable. It is compliant.

Security: The Overlooked Dimension

Stanford research indicates 41% of AI legal tools have significant security weaknesses. This is the dimension that makes me most nervous, because security failures are invisible until they are catastrophic.

Your security evaluation should cover:

Data handling. Where is client information processed? On-premises, in a vendor cloud, or through a third-party API? Who has access? How is data stored, for how long, and what happens when you terminate the contract?

Training data usage. Does the vendor use your queries or documents to train or improve their models? If yes, your client's confidential information may influence outputs for other users. This is not hypothetical — it is how several major AI platforms operate by default.

Vendor contracts. Do indemnification clauses specifically address AI-specific failures — hallucinations resulting in financial loss, autonomous actions beyond authorized scope, data leakage through model training?

Incident response. Does the vendor have documented protocols for AI-related errors or regulatory inquiries? What is the notification timeline? What remediation do they provide?

If the vendor cannot answer these questions clearly, that is itself an answer.

The Evaluation Methodology

Rather than relying on vendor benchmarks, conduct your own testing. Here is the framework I use.

Baseline testing. Run known queries with verified correct answers. Measure accuracy against ground truth across your specific practice areas and document types. A tool that performs well on standard commercial contracts may perform poorly on specialized regulatory filings.

Edge case testing. Test unusual fact patterns, minority jurisdictions, recent statutory changes, and areas where the law is genuinely unsettled. These are the situations where hallucination risk is highest and where you most need reliable output.

Adversarial testing. Deliberately include incorrect premises. Ask the AI to support a position that the law does not support. See whether it pushes back or generates a convincing but wrong analysis. This tests for sycophancy and reveals how the system handles uncertainty.

Longitudinal monitoring. Accuracy can change when models are updated. A system that performs well in your initial evaluation may degrade after a vendor update that changes the underlying model. Establish ongoing testing protocols — quarterly at minimum.

Risk-Based Verification Layers

Not all AI output carries the same risk. Your verification requirements should scale accordingly.

Low risk (internal ideation). Spot checks acceptable. The AI suggests research directions, brainstorms approaches, generates initial outlines. If these are for internal use only and will be substantially reworked, proportionate verification is reasonable.

Medium risk (drafting). Review for flow, logic, and general accuracy. The AI generates draft language, correspondence, or internal memoranda. Human editing is expected as part of the workflow.

High risk (citations, case analysis, client-facing work). Source verification mandatory for every citation. Each case must be confirmed to exist, the holding must be verified, and the connection between the citation and the proposition must be validated. No exceptions.

The Stanford data indicates that firms with mandatory human review report 94% fewer AI-related errors. That number alone justifies the verification investment.

Documentation for Defensibility

For malpractice defense and regulatory compliance, document everything about your AI tool adoption:

Tool selection rationale. Why this system? What alternatives were evaluated? What testing was performed? This establishes that the selection was deliberate rather than arbitrary.

Testing methodology and results. Your baseline, edge case, and adversarial testing data. This shows you evaluated the tool's limitations, not just its capabilities.

Usage policies and training records. Evidence that users were trained on both the tool's capabilities and its limitations. If a lawyer was never told that the tool hallucinates 17-33% of the time, the firm bears responsibility for that knowledge gap.

Verification procedures. Specific to each matter category. How are outputs checked? Against what sources? By whom? At what level of detail?

Incident reports and remediation. Past errors, how they were detected, what was done about them, and what systemic changes resulted.

Regulatory Alignment

As of 2025, 91% of US state bars are developing AI-specific guidance. ABA Formal Opinion 512 (July 2024) establishes baseline obligations but does not prescribe specific evaluation procedures.

Your framework should align with state bar guidance in your relevant jurisdictions, the ABA Commission's working group recommendations (February 2025), and the NIST AI Risk Management Framework — compliance with which provides safe harbor in Colorado and Texas.

If you are in the EU, Article 4 of the AI Act adds another layer. "Sufficient AI literacy" is now a regulatory requirement, and the regulators will eventually define what that means for legal professionals procuring AI tools.

The Bottom Line

AI tool evaluation should receive the same rigor you would apply to hiring an associate or engaging an expert witness. The technology offers genuine efficiency gains, but those gains come with measurable risks that terms-of-service agreements do not address.

A systematic framework — testing accuracy independently, evaluating security, scaling verification to risk, and documenting decisions — converts AI adoption from an act of faith into a managed engineering process.

The tools are good enough to use. They are not good enough to trust blindly. The difference between those two statements is where your evaluation framework lives.


Key Takeaways

  • Major legal AI platforms hallucinate between 17% and 33% of the time — vendor claims of "hallucination-free" are empirically disproven
  • Two hallucination types require different testing: incorrect information and misgrounded citations where the source does not support the claim
  • 41% of AI legal tools show significant security weaknesses — evaluate data handling, training data usage, and incident response
  • Verification should scale with risk: spot checks for ideation, mandatory source verification for citations and client-facing work
  • Firms with mandatory human review report 94% fewer AI-related errors — the verification investment pays for itself