The Stanford RegLab Legal AI Accuracy Study is a peer-reviewed empirical evaluation of the accuracy of commercial AI legal research tools, published in 2024 by researchers at Stanford Law School's RegLab and CodeX center. Its full citation is: Magesh, Varun, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning, and Daniel E. Ho. "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools." Stanford RegLab / Stanford CodeX, 2024.
The study is the first independent, large-scale attempt to benchmark accuracy across multiple commercial legal AI platforms using a standardized question set evaluated by practicing attorneys. Prior to its publication, the only available accuracy data on commercial legal AI came from vendor self-assessments and marketing materials. The study introduced a comparative "mistake rate" — the proportion of AI responses containing factual errors, including hallucinated case citations, incorrect holdings, and inaccurate procedural statements — that enabled direct comparison across platforms.
Its findings generated substantial industry response, including public disputes from two major legal information vendors, a revised methodology paper from the authors, and policy responses from bar associations that cited the study's findings in guidance on AI use in legal practice.
Before this study existed, lawyers and law firms evaluating legal AI tools had no independent accuracy benchmarks to work from. Vendor demonstrations showed the tools performing well on carefully selected examples. Marketing materials cited internal "accuracy rates" without methodology disclosure. Procurement decisions relied heavily on brand reputation and anecdotal reports from peer firms.
The Stanford study changed that baseline. Its core finding — that even the best-performing commercial legal AI tools produced factual errors in approximately 17% of tested responses — has direct practical implications for how lawyers should use these tools.
To understand the stakes, consider the context. In the Mata v. Avianca case (S.D.N.Y. 2023), an attorney filed a brief containing six AI-generated fake case citations. The court imposed sanctions of $5,000 on the law firm and the attorneys personally. The FRCP does not distinguish between errors caused by carelessness and errors caused by AI tool failures — the filing attorney bears responsibility regardless of source. A 17% error rate in legal research outputs means that roughly one in six AI-generated research results requires correction before a competent attorney can rely on it.
Several law firms updated their AI acceptable use policies specifically in response to the Stanford study, including requirements that all AI-generated legal citations be independently verified before inclusion in any court filing, brief, or client advice. At least one state bar association (New York State Bar Association) issued guidance on AI in legal practice in 2024 that referenced independent accuracy benchmarks as a factor attorneys should consider when selecting tools.
The study also prompted vendors to improve their products. Both Thomson Reuters (Westlaw AI) and RELX (Lexis+ AI) made public representations about product improvements in the period following the study's publication, though independent follow-up benchmarking has been limited.
For solo practitioners and small firms without dedicated legal technology procurement resources, the study provides a specific data point: the best-performing platforms showed 17% mistake rates on the tested question set. That number should be the starting assumption for any AI legal research output until the attorney has independently verified the result.
How It Works (Technical)
The study's methodology was designed to produce results that are both rigorous and replicable. Understanding the methodology is important for interpreting the findings correctly — including their limitations.
Question set construction: The researchers developed a set of 202 legal questions drawn from real attorney research tasks. Questions were designed to span multiple practice areas, jurisdictions, and question types (case law research, statutory interpretation, procedural rules). The questions were written to reflect genuine attorney queries rather than edge cases designed to trip up AI systems.
Platform selection and testing: The researchers tested multiple commercial platforms including a raw GPT-4 baseline (without legal-specific grounding), Westlaw AI (Precision), Lexis+ AI, Practical Law AI, and CoCounsel (then operating as Casetext's AI product). Each platform received the same 202 questions. Responses were captured and evaluated blind by two independent lawyer-reviewers.
Evaluation criteria: Each response was evaluated for factual accuracy. The "mistake rate" was defined as the proportion of responses containing any factual error — including hallucinated citations (citations to cases that do not exist), incorrect case holdings, incorrect procedural rules, or material misstatements of legal standards. The evaluation was binary: the response either contained a factual error or it did not.
Key findings by platform:
- GPT-4 (raw, without legal grounding): 88% mistake rate
- Westlaw AI (Precision): 33% mistake rate
- Lexis+ AI: 17% mistake rate
- Practical Law AI: 17% mistake rate
- CoCounsel (Casetext): 17% mistake rate
The 88% mistake rate for raw GPT-4 established a baseline for what an ungrounded large language model does when asked legal questions without access to verified legal databases. The commercial platforms' lower error rates demonstrate the value of grounding AI responses in verified legal databases — but also confirm that grounding does not eliminate errors.
Vendor disputes and methodology revision: Thomson Reuters and RELX both publicly disputed the study's methodology in statements issued in Q3 2024. Their objections centered on the representativeness of the question set (arguing that the 202 questions were weighted toward question types where their platforms underperformed) and the binary evaluation method (arguing that partial errors were treated the same as complete fabrications). The Stanford research team published a response paper in Q4 2024 maintaining their core findings while acknowledging that the question set represented one domain of legal research rather than all possible uses. The core finding — that all tested commercial platforms showed material error rates — was not retracted.
How Legal AI Vendors Address It
Westlaw Precision AI (Thomson Reuters) combines GPT-4 with Westlaw's verified legal database and citation validation features. The 33% mistake rate in the original study represented its performance before a series of product updates Thomson Reuters released in late 2024. Thomson Reuters subsequently claimed improvements but has not published an independently audited post-update accuracy rate. Limitation: the Westlaw Precision AI product update pace means that any specific accuracy benchmark becomes stale relatively quickly; lawyers should treat the Stanford study as establishing a floor, not a ceiling, for current performance.
Lexis+ AI (LexisNexis) performed at the 17% mistake rate tier in the Stanford study, tied for best performance among the tested platforms. Lexis+ AI grounds responses in LexisNexis's database of verified case law and statutes. LexisNexis has invested in citation-checking features that surface the source documents underlying each AI response, allowing attorney verification. Limitation: the 17% error rate on the tested question set does not mean the tool is accurate 83% of the time across all legal tasks — the question set covered specific research question types, and performance varies by question complexity and practice area.
CoCounsel (Thomson Reuters, formerly Casetext) matched the 17% mistake rate in the original study. CoCounsel's architecture grounds responses in verified legal databases and provides source citations with each research answer. Post-acquisition by Thomson Reuters, CoCounsel has been integrated more closely with Westlaw infrastructure. Limitation: as with all platforms, the tested accuracy reflects a specific question set at a specific time; performance on a particular lawyer's specific research question will vary.
Paxton AI was included in expanded testing with results that varied by question type and jurisdiction, performing comparably to the leading platforms on some question categories and worse on others. Paxton AI is positioned for government and regulatory practice. Limitation: its performance profile in the Stanford testing was less consistent across question types than the top-performing platforms.
Harvey AI was not included in the original Stanford study (it was not publicly available in the same form at the time of testing). Subsequent internal benchmarking by Harvey has been published in marketing materials but has not been independently verified by researchers using the Stanford methodology. Limitation: the absence of independent benchmark data means Harvey's accuracy claims rest on vendor self-assessment.
How Lawyers Should Apply the Findings
-
Treat 17% as your prior, not your floor. The best-performing platforms showed 17% error rates on a structured question set. For your specific research question — which may be more complex, more jurisdictionally specific, or in a less-covered practice area — the error rate may be higher. Verify every substantive AI research output before relying on it in a filing or client advice.
-
Use the study to frame vendor conversations, not to make final decisions. The study tested specific platforms at a specific point in time on a specific question set. Use the findings to ask vendors pointed questions: "What is your current error rate on [specific question type]?" "How has your platform's accuracy changed since the Stanford study?" If a vendor cannot provide methodology-backed answers, treat that as a limitation to weigh.
-
Require source citations for every AI research output. The tested platforms that performed best all provided source citations grounding their responses in verified databases. An AI research answer with no source citation has no independently verifiable accuracy — do not use it. Require that any AI research tool you use surfaces the underlying case or statute for every factual claim.
-
Document your verification process for AI-assisted filings. Before filing any document that includes AI-assisted research, verify each cited authority independently (in Westlaw, Lexis, or a government database), note the verification in your work file, and ensure that your firm's AI acceptable use policy documents this requirement.
-
Monitor for updated independent benchmarks. The Stanford study was the first; it will not be the last. As independent accuracy benchmarking becomes more common, more current and comprehensive data will be available. Track publications from Stanford RegLab, RAND, and the Georgetown Law Institute on academic AI research for updates.