We respect attorney-client confidentiality. No tracking pixels in our emails.
We respect attorney-client confidentiality. No tracking pixels in our emails.

AI hallucination — when a legal AI system fabricates citations, misquotes holdings, or invents authority that does not exist — remains a documented risk in 2026, including in premium tools. Stanford HAI's peer-reviewed study found Westlaw AI hallucinated on 33% of queries and Lexis+ AI on 17%.
2026/04/21
Last reviewed: 2026/05/18
TL;DR · AI hallucination — when a legal AI system fabricates citations, misquotes holdings, or invents authority that does not exist — remains a documented risk in 2026, including in premium tools. Stanford HAI's peer-reviewed study found Westlaw AI hallucinated on 33% of queries and Lexis+ AI on 17%. The cost of getting it wrong is no longer theoretical: at least 486 documented court cases worldwide have involved AI-generated false citations, with $50,000+ in court-assessed fines and ABA Model Rule 1.1 Comment 8 now explicitly requiring AI competence. This guide explains how hallucination happens, why "100% hallucination-free" marketing claims are misleading, and exactly how lawyers should verify before filing.
The term gets used loosely, often by the vendors selling tools that have them. Stanford's RegLab and Human-Centered AI Institute (HAI) — in the most rigorous independent benchmark of legal AI to date — defined hallucination as a response that is either:
The second category is the one lawyers underestimate. A perfectly real citation to a real case can still be a hallucination if the case doesn't say what the AI claims it says. That's not a technicality — it's the failure mode in Mata v. Avianca, where ChatGPT cited cases that don't exist alongside real-sounding holdings that weren't actually held.
In Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools (Journal of Empirical Legal Studies, 2025), the Stanford team tested over 200 open-ended legal queries against three leading platforms:
| Tool | Hallucination Rate | Accuracy Rate | Vendor Marketing Claim |
|---|---|---|---|
| Westlaw AI-Assisted Research | ~33% | 42% | "Grounded in good law" |
| Lexis+ AI | ~17% | 65% | "100% hallucination-free linked legal citations" |
| Ask Practical Law AI | ~17% | — | "Reliable" |
| GPT-4 (no legal grounding, baseline) | Higher than all three | — | N/A |
The headline finding: the most premium tools available to lawyers in 2024–2025 hallucinated on one in six to one in three queries. Marketing claims of "100% hallucination-free" did not survive empirical testing.
It is worth noting that vendors have iterated since the Stanford study. Both LexisNexis and Thomson Reuters have shipped product updates and contest some of the methodology. Harvey, a separately tested premium platform not included in the Stanford benchmark, claims an internal hallucination rate of approximately 0.2%. As of mid-2026, no independent third-party benchmark has confirmed any vendor's internal claims at production scale.
The reasonable lawyer's position: the trend line is improving, the marketing remains ahead of the reality, and verification remains the lawyer's responsibility.
Large language models — including those underneath legal AI tools — generate text by predicting which words are statistically likely to come next, given the prompt and the training data. They do not "look up" answers in the way a human researcher does. They synthesize.
That synthesis is excellent when the underlying training data is dense and consistent (e.g., generating fluent prose, summarizing well-trodden concepts). It is fragile when the answer requires precision (e.g., a specific case citation, a statutory subsection number, a holding from a recent decision the model hasn't seen).
Modern legal AI tools attempt to mitigate this with Retrieval-Augmented Generation (RAG): before answering, the system pulls relevant documents from a verified database (case law, statutes, treatises) and grounds the response in those documents.
RAG reduces hallucination. It does not eliminate it. The retrieval step can pull the wrong document. The generation step can misinterpret what it retrieved. The citation step can attach the right citation to the wrong proposition. Each link in the chain has its own failure rate, and they compound.
This is why Stanford's "misgrounded" category matters so much. A tool can retrieve a real case, generate a confident-sounding answer, and cite the case correctly — and still be wrong about what the case actually held. That is the failure mode that gets lawyers sanctioned, because it looks completely legitimate on the page.
Roberto Mata sued Avianca Airlines after a metal serving cart struck his knee on an international flight. The case was routine. The defense was routine. The brief his attorneys filed was not.
In March 2023, Steven A. Schwartz of Levidow, Levidow & Oberman P.C. used ChatGPT to research the response to Avianca's motion to dismiss. ChatGPT generated a brief citing six cases:
None of them existed.
The opinions attributed to real judges — Judge Barrington D. Parker of the Second Circuit, Judge Reggie B. Walton — were entirely fabricated, with invented holdings and invented quotations. ChatGPT had put words into real judges' mouths.
What turned this from an embarrassment into a sanction was what happened next. When Avianca's counsel notified the court that they could not locate the cases, Schwartz did not withdraw. When the court itself could not locate the cases and ordered him to produce them, Schwartz returned to ChatGPT, asked it to confirm the cases were real, accepted the AI's reassurance ("the cases I provided are real and can be found in reputable legal databases such as LexisNexis and Westlaw"), and submitted ChatGPT-generated "excerpts" of the fake decisions to the court. Those excerpts were themselves fabricated.
Judge P. Kevin Castel held a packed-courtroom sanctions hearing on June 8, 2023, and issued a 46-page opinion on June 22. Schwartz, his colleague Peter LoDuca who had signed the brief without reading it, and the firm itself were each fined $5,000. The opinion was sharp: Judge Castel called one of the fabricated "legal analyses" gibberish, and emphasized that the lawyers had acted in "subjective bad faith" not because they used ChatGPT, but because they failed to correct course when the fabrications were exposed.
The court was explicit on this point. From the opinion:
"Technological advances are commonplace and there is nothing inherently improper about using a reliable artificial intelligence tool for assistance. But existing rules impose a gatekeeping role on attorneys to ensure the accuracy of their filings."
The lesson was never "don't use AI." The lesson was "you remain responsible for what you sign."
That lesson has become regulatory. In July 2024, the American Bar Association issued Formal Opinion 512, its first formal ethics opinion on generative AI in legal practice. The opinion grounds AI use under existing Model Rules — primarily Rule 1.1 (competence), Rule 1.6 (confidentiality), Rule 1.5 (fees), and Rule 5.1/5.3 (supervisory responsibility). Comment 8 to Rule 1.1, which requires lawyers to keep abreast of relevant technology, now expressly contemplates generative AI.
State bars followed. As of mid-2026, more than 30 state bars have issued AI guidance, and many treat Mata as the regulatory blueprint.
The deterrent effect of a $5,000 sanction has been less than complete. In the 30 months following Judge Castel's order, at least 15 additional documented cases have involved attorneys submitting AI-generated briefs containing fabricated citations. Damien Charlotin's AI Hallucination Cases Database — a public, ongoing record of judicial sanctions for AI-generated false citations — has documented 486 cases worldwide, implicating 128 lawyers and 2 judges, with cumulative court-assessed fines exceeding $50,000.
A few representative incidents:
Park v. Kim (2d Cir. 2024): The Second Circuit imposed sanctions and dismissed an appeal in part because of AI-generated citations. The opinion explicitly noted that Mata had been decided eight months earlier and that the attorney had been on constructive notice of the hallucination risk. The court rejected any claim of novelty.
Wadsworth v. Walmart: AI-generated citations submitted in federal litigation. Sanctions imposed.
Johnson v. Dunn (N.D. Ala. 2025): A large, well-regarded firm submitted AI-generated false citations in a motion. The court declared in its sanctions opinion that monetary penalties alone are proving insufficient and that "something more is needed." The court declined to accept as mitigation the fact that a practice group co-leader, rather than the signing attorney, had introduced the hallucination.
The pattern is now clear enough to be considered industry-wide notice: lawyers signing briefs are personally responsible for verifying AI output, regardless of who in the firm produced the draft.
Different categories of tools have different exposure to hallucination, and different mitigation strategies.
These tools generate novel answers to novel questions, often citing case law. They are the highest-risk category and the category Stanford benchmarked. Mitigation approaches in market in 2026:
Westlaw Precision (Thomson Reuters) — Uses RAG over the Westlaw database. Includes KeyCite citation validation, which checks whether cited cases have been overruled or treated negatively. Even with these mitigations, Stanford found ~33% hallucination rate, though product iterations since the study may have improved this.
Lexis+ AI (LexisNexis) — Uses RAG over the Lexis database. Marketed as "100% hallucination-free linked legal citations," a claim Stanford's testing did not support. Tested rate ~17%. Recent versions emphasize transparent source linking.
CoCounsel (Casetext / Thomson Reuters) — Originally Casetext, acquired by Thomson Reuters for $650M. Designed around grounding responses in known reliable sources. Stanford tested an adjacent Thomson Reuters product (Ask Practical Law AI) rather than CoCounsel directly.
Harvey AI — Enterprise platform with custom legal models trained on case law. Claims a 0.2% internal hallucination rate. Not yet independently benchmarked at this rate. Multi-layered verification system including embedding-based document matching and real-time Shepardization.
Paxton AI — Newer entrant emphasizing citation transparency and verification workflows.
A useful framing: vendors are competing on hallucination rate, and rates appear to be improving. But none of the rates published in independent third-party studies in 2024–2025 reached zero, and a 17% rate on a legal research tool means roughly one in every six queries produces a false claim. Verification is not optional.
Contract review AI — Spellbook, Luminance, Kira, LawGeex — operates against a known input document and a known playbook of preferred clauses and risks. The output is much less likely to fabricate authority because it isn't generating novel legal arguments. The failure mode shifts from fabrication to over- or under-flagging clauses, which a careful reviewer can audit against the source document.
Tools like Everlaw, Relativity AI, and Briefpoint handle large document populations. Hallucination risk is lower in classification tasks, higher in summarization and document-timeline generation, where the AI is constructing a narrative.
LawyerAI scores each tool's accuracy across these categories independently. See our methodology for how the 5-Dimension framework weights accuracy by primary use case.
Every output from a generative legal AI tool — including the premium ones, including the ones with the most aggressive "hallucination-free" marketing — should pass through this protocol before any work product leaves your office.
For every case cited, pull it up in Westlaw, Lexis, or the appropriate court's free database (PACER, state court systems, Google Scholar for federal). If the citation does not resolve to a real case, it is a hallucination. If it resolves but the case is not what the AI says it is, it is a misgrounded hallucination — equally dangerous.
This is not optional even for the tools that claim citation validation. KeyCite and Shepard's flag overruled or negatively treated cases; they do not flag cases that never existed in the first place.
A common AI failure: the citation is real, the case is real, but the holding the AI attributes to the case is something the case never said. Skim the actual opinion. If you cannot quickly find the proposition the AI claims the case stands for, the AI is probably hallucinating the proposition.
In Mata, one fabricated case was attributed to a court that wasn't operating in the cited year. In real (non-hallucinated) cases, the equivalent failure mode is the AI citing a case that has been overruled. Run KeyCite or Shepard's on every case. Note any negative treatment.
If a research answer is high-stakes, run the same query through a second tool (a different vendor, or your traditional research workflow). Disagreement between tools is a signal to dig deeper. Agreement is suggestive but not dispositive — both tools can be wrong in the same direction if the underlying training data is biased.
The recurring lesson of Mata, Park v. Kim, and Johnson v. Dunn is the same: courts hold the signing attorney responsible, regardless of who in the firm produced the draft. Read the brief. Pull the cases. If you have not personally verified the citations in a document over your signature, you should not sign it.
A 30-minute verification pass on a 10-page brief is the cost of using these tools responsibly. It is also a fraction of the time AI saves on drafting, which is why the math still works. The lawyers who lose are the ones who treat the AI's confidence as a substitute for verification.
The Stanford study is most damning not in its hallucination rate findings — those will improve with iteration — but in its catalog of vendor marketing claims that did not hold up:
Lexis: "Lexis+ AI delivers 100% hallucination-free linked legal citations."
Casetext: "CoCounsel does not make up facts, or 'hallucinate.'"
Thomson Reuters: "We avoid hallucinations by relying on the trusted content within Westlaw."
Each of these claims is, on the empirical evidence available in 2024–2025, false or at best aspirational. Vendors hold themselves out as having solved a problem they have meaningfully reduced but not eliminated.
This matters for two reasons. First, lawyers are not technical evaluators; many take "hallucination-free" claims at face value and adjust their verification practices accordingly. The Stanford finding that lawyers using Westlaw AI face one-in-three hallucination rates is, in practice, a finding that some lawyers are filing one-in-three briefs with potential citation defects.
Second, the gap between marketing and reality is itself a methodology question. LawyerAI scores Accuracy on independent verification where available, not on vendor self-report. When marketing claims diverge from third-party testing, we score by the test, not the claim. Vendors who provide independent benchmark data — or invite independent testing — earn higher Accuracy scores. Vendors who do neither earn an "Indicative" label on Accuracy until verification is possible.
Three changes most firms should consider, if they have not already:
1. Adopt a verification policy in writing. A documented internal protocol — every citation verified against a primary source, every signed brief reviewed by the signing attorney, no AI-generated brief filed without a verification log — is what state bars are starting to ask about in disciplinary investigations. The policy doesn't have to be elaborate. It has to exist.
2. Train associates on what hallucination actually looks like. The Mata lawyers did not file fake cases because they were lazy. They filed fake cases because they did not understand that ChatGPT could fabricate citations confidently. Comment 8 to Rule 1.1 — competence with relevant technology — now expressly requires this understanding. An hour-long training session, repeated annually, would have prevented every documented AI sanction.
3. Pick tools by their hallucination posture, not by their marketing. A tool that publishes independent benchmark data is a tool whose vendor is willing to be measured. A tool that publishes only its own internal metrics is a tool whose vendor wants you to trust without verifying. The first signals confidence. The second signals something else.
The premium tools — Harvey, Westlaw Precision, Lexis+ AI, CoCounsel — are all meaningfully better than ChatGPT for legal research, and the trajectory is improving. None of them are good enough, in mid-2026, to remove the lawyer's verification burden. Treating them as if they were is what gets attorneys sanctioned.
Vendors have iterated. LexisNexis and Thomson Reuters have shipped product updates since 2024, and both contest aspects of the Stanford methodology. As of mid-2026, no independent third-party benchmark has been published at the same rigor confirming the improvements. The reasonable default: assume non-zero hallucination, verify accordingly.
Harvey claims a 0.2% internal hallucination rate, materially lower than the Stanford-tested rates for Westlaw (33%) or Lexis+ AI (17%). The claim has not been independently verified at production scale. For Big Law firms with the budget for Harvey's enterprise tier, the architecture appears more robust; for everyone else, the practical answer is the same — verify before you file.
No. RAG meaningfully reduces hallucination by grounding answers in a real document database, but it does not eliminate it. The retrieval step can pull the wrong document. The generation step can misinterpret what it retrieved. The Stanford-tested tools all use RAG and still hallucinated.
Comment 8 requires lawyers to "keep abreast of changes in the law and its practice, including the benefits and risks associated with relevant technology." In the AI context, the ABA's July 2024 Formal Opinion 512 and the cascade of state bar opinions interpret this as requiring lawyers to understand how AI tools work, including the risk of hallucination, before using them in client representations.
No. Mata, Park v. Kim, and Johnson v. Dunn all confirm that the signing attorney bears responsibility regardless of who in the firm produced the draft. Johnson v. Dunn explicitly rejected the argument that a practice group co-leader's hallucination was a defense for the signing attorney.
LawyerAI is an independent directory of AI tools for lawyers, scored across 5 dimensions. We do not accept affiliate commissions. Featured placement is clearly disclosed and does not influence editorial scoring. For methodology questions, score revision requests, or editorial contact: editor@lawyerai.directory.
This guide is version 1.0, published 2026-05-18. Material revisions will be versioned and dated.