We respect attorney-client confidentiality. No tracking pixels in our emails.
We respect attorney-client confidentiality. No tracking pixels in our emails.

What the Stanford RegLab data actually shows about AI hallucination rates in legal research tools. 17%–88% variance by platform, 27 documented sanctions cases, and a 7-point verification checklist.
2026/05/18
On June 22, 2023, two New York lawyers were sanctioned by a federal judge for filing a brief with six fabricated case citations — all generated by ChatGPT. By May 2026, similar sanctions had been documented in 27 US jurisdictions and four other countries. The legal AI hallucination problem hasn't gone away. It's only become more invisible — because the tools got better at making fake citations look real.
Most "best legal AI" lists are written by vendors or affiliates. This one isn't.
This risk report follows four rules that govern everything published on LawyerAI:
LawyerAI does not accept vendor payment that influences scores. No vendor relationship changes the accuracy data we report or the risk levels we assign. If a vendor offers us placement in exchange for a favorable rating, we decline.
Every tool has real limitations — including the ones we recommend. The tools with the lowest hallucination rates in this report still hallucinate. A 17% error rate means one in six citations still needs manual verification. We say that plainly.
Pricing is published transparently — if a vendor won't publish it, we say "not published." For hallucination rates, the same rule applies: if independent data doesn't exist, we say so. We do not substitute vendor claims for independent evidence.
Accuracy data from independent third parties only — vendor self-reported figures are labeled as such. Throughout this report, every data point is sourced. "Stanford RegLab 2024, independent" means it came from an independent study. "Vendor-reported" means the vendor said it themselves and it has not been independently verified.
The hallucination risk in legal AI research varies from 17% to 88% depending on the tool, and that gap has life-altering consequences for the lawyers who use them. The platforms with the lowest documented error rates are Lexis+ AI and CoCounsel, both at 17% per the Stanford RegLab 2024 independent study. Westlaw Precision AI measured 33% in the same study. General-purpose models like GPT-4 tested at 88% — nearly nine out of ten citations contained errors on the controlled benchmark.
For federal court brief filings, only tools with independent accuracy data should be used — and every citation should still pass the 7-point verification checklist in Chapter 7 of this report. For internal memos and unfiled research, the risk tolerance can be higher, but verification cannot be skipped entirely. No current legal AI tool eliminates hallucination. The question is how much risk you are willing to carry and whether your malpractice insurance policy has a position on it.
LawyerAI evaluates every tool across five dimensions: Accuracy, Speed, Usability, Value, and Security. Each dimension is scored 1.0–5.0 based on independent data where available, supplemented by structured testing. Hallucination rate is the primary driver of the Accuracy score for legal research tools. A full explanation of how each dimension is weighted and sourced is at /methodology.
For this risk report, Accuracy is the controlling dimension. Speed, usability, and value matter — but they are secondary to whether the tool puts fabricated citations in your brief.
| Platform | Hallucination Rate | Data Source | Grounded Retrieval | Best For |
|---|---|---|---|---|
| Lexis+ AI | 17% | Stanford RegLab 2024 (independent) | Yes | Federal court research, accuracy-critical work |
| CoCounsel | 17% | Stanford RegLab 2024 (independent) | Yes | Litigation research with Westlaw subscription |
| Westlaw Precision AI | 33% | Stanford RegLab 2024 (independent) | Yes | Broad corpus research with verification workflow |
| Harvey AI | Not published | No independent data as of May 2026 | Vendor-reported | Transactional work, large firm deployment |
| Paxton AI | Not published | No independent data as of May 2026 | Unknown | Government, public sector legal research |
| vLex Vincent | Not published | No independent data as of May 2026 | Unknown | International and multi-jurisdiction research |
| GPT-4 (baseline) | 88% | Stanford RegLab 2024 (independent) | No | Not appropriate for legal citation research |
Table current as of May 2026. Independent hallucination rate data for Harvey AI, Paxton AI, and vLex not published as of this report's lastReviewedAt date. LawyerAI will update this table when independent data becomes available.
The legal industry's reckoning with AI hallucination began in a federal courtroom in the Southern District of New York. On June 22, 2023, attorneys Steven Schwartz and Peter LoDuca of the firm Levidow, Levidow & Oberman filed a brief in Mata v. Avianca, Inc. that cited eleven cases as precedent for their client's position. Six of those cases did not exist.
The citations were not obviously fake. They followed proper Bluebook format. They named real courts, real reporters, real years. They had page numbers, docket numbers, and everything that a citation should have. What they did not have was any underlying reality. The cases had been generated by ChatGPT, and ChatGPT had invented them with the same confidence it uses when discussing things that actually exist.
When opposing counsel noticed inconsistencies and the court ordered the attorneys to produce the cited cases, Schwartz initially doubled down — asking ChatGPT to confirm the citations were real, which it did. The model confirmed its own fabrications. When the court convened a sanctions hearing, Judge P. Kevin Castel issued a $5,000 fine to each attorney and ordered the brief stricken. The order was published publicly and became the template that other courts would use in the 27 documented sanctions cases that followed.
What made Mata v. Avianca a watershed moment was not just that it happened. It was that the attorneys genuinely did not know they were using fabricated citations. Schwartz testified that he had believed ChatGPT was a legal research database — a search engine of law, not a generative model that produces plausible-sounding text. This misunderstanding is still widespread. Many attorneys who use legal AI tools do not understand the difference between retrieval (finding a real document) and generation (creating a plausible-sounding document). That gap in understanding is where sanctions happen.
By May 2026, the 27 documented US sanctions cases span federal district courts across at least fourteen circuits and state courts in eleven states. Four cases in other countries — including one in the UK courts and one before an EU tribunal — have been documented in public filings. The actual number of AI hallucination incidents that did not result in sanctions — because opposing counsel didn't notice, because the brief was withdrawn before filing, or because the matter settled — is impossible to quantify. The sanctions cases are the visible surface of a much larger problem.
The question the legal industry has not fully answered is how to build a workflow that treats AI hallucination as a structural risk rather than an individual competence failure. Schwartz and LoDuca were sanctioned. But the failure was systemic: they were using a tool designed for general conversational purposes to perform a specialized, high-stakes task that requires factual accuracy as a prerequisite to professional responsibility.
The most significant independent hallucination study of legal AI tools published to date is the Stanford RegLab's 2024 comparative analysis. The study tested major legal AI platforms against a controlled benchmark of case law research tasks — questions that required the model to identify, cite, and characterize real cases. This is the most relevant real-world analog to what lawyers actually do in legal research.
The findings revealed a range that should concern every attorney using AI tools for citation work:
GPT-4 (general-purpose baseline): 88% case mistake rate. On the Stanford benchmark, nearly nine out of ten citations produced by GPT-4 contained material errors — wrong citations, misattributed holdings, or cases that do not exist. This is the baseline for what an ungrounded large language model does with legal research. It does not retrieve cases. It generates plausible-looking citations based on patterns in its training data. The 88% figure is not surprising once you understand the architecture; it is alarming only if you thought GPT-4 was doing research.
Westlaw AI-Assisted Research: 33% error rate. Westlaw's AI research features, which pull from one of the largest proprietary legal databases in existence, produced a 33% error rate on the Stanford benchmark. At this rate, one in three citations needs manual verification before it can be filed or relied upon. That is not a minor inconvenience — it is a professional responsibility issue that requires a systematic verification workflow for every AI-assisted research session.
Lexis+ AI: 17% hallucination rate. Lexis+ AI tested at 17% in the Stanford study, the lowest rate of any major platform in the benchmark. This is meaningfully better than Westlaw's 33%, and it is dramatically better than GPT-4's 88%. It is not, however, zero. A 17% error rate means one in six citations still requires verification.
CoCounsel: 17% (tied with Lexis+ AI). CoCounsel, which requires a Westlaw subscription and uses Westlaw's corpus with a different AI layer, also tested at 17%. This result suggests that the AI layer matters independently of the corpus — CoCounsel's results were substantially better than Westlaw's native AI features despite using the same underlying case law database.
The Stanford benchmark is a controlled environment, not field conditions. Real-world hallucination rates may differ based on the complexity of the question, jurisdiction coverage, the age of the relevant case law, and the edge-case nature of the research task. The benchmark results are the best independent data available, not a guarantee of performance in any specific matter.
Why vendor self-reported accuracy numbers cannot substitute for independent data: vendors have obvious financial incentives to report favorable accuracy figures. Several legal AI vendors publish accuracy claims on their marketing materials without specifying the benchmark used, the question types tested, the jurisdiction coverage, or whether the benchmark was designed by the vendor or by a third party. Until a vendor submits to an independent study with a published methodology and allows the results to be reported regardless of outcome, their self-reported accuracy figures should be treated as marketing, not evidence.
Understanding hallucination risk requires understanding that not all hallucinations are the same. There are three distinct categories, each with a different detection difficulty and a different risk profile for practicing attorneys.
Category 1: Fabricated Citations. This is the Mata v. Avianca category — the AI invents case names, docket numbers, reporter volumes, and page numbers that do not exist. The citation looks completely real. "Smith v. Jones, 742 F.3d 891 (7th Cir. 2019)" has the right format, the right reporter, the right circuit designation, and a year that falls within the relevant period. The case simply doesn't exist in any database.
Fabricated citations are the most dangerous category for brief filings because they are most likely to result in sanctions. They are also, paradoxically, the easiest category to catch — a citator search on Westlaw or Lexis will immediately reveal that the case doesn't exist. The problem is that attorneys who trust the AI tend to skip the citator check. The AI's confident presentation of the citation is interpreted as confirmation that the case exists.
Category 2: Misattributed Holdings. This category is subtler and arguably more dangerous than fabricated citations in practice. The case exists. The citation format is correct. The AI has simply stated the wrong holding. It might correctly cite Bell Atlantic Corp. v. Twombly, 550 U.S. 544 (2007), but then characterize the pleading standard it established incorrectly — understating the "plausible" requirement, conflating it with the Rule 8 notice pleading standard it replaced, or applying it to a context where it doesn't control.
Misattributed holdings are harder to catch because attorneys stop checking once they confirm the case exists. If a citator search confirms that Twombly is a real Supreme Court case from 2007 — which it is — the attorney may not read the actual case text to verify that the AI's characterization of its holding is accurate. This is the silent failure mode: technically real citations being used to support propositions the cases don't actually stand for.
Category 3: Phantom Statutes. The third category covers regulatory and statutory hallucinations — AI inventing statutory provisions, regulatory sections, or administrative interpretations that do not exist. This is more common in regulatory practice and administrative law than in case law research, and it is particularly dangerous because statutes and regulations are less commonly memorized by practitioners than landmark case names.
A litigator encountering a citation to "Smith v. Jones" they've never heard of is likely to verify it. A transactional attorney working in a specialized regulatory area encountering "17 C.F.R. § 275.206(4)-7(b)(3)(ii)(C)" may not notice if the AI has added a subsection that doesn't exist, especially if the surrounding regulatory structure is accurate. The phantom statute category is growing as AI tools are used more frequently in regulatory compliance work — an area where /solutions/in-house teams are significant users of AI research tools.
The explanation for why Lexis+ AI and CoCounsel achieve 17% error rates while GPT-4 achieves 88% lies in a technical architecture called Retrieval-Augmented Generation — RAG. Understanding how RAG works is essential to understanding both why the best tools are better and why no tool is safe to use without verification.
Ungrounded generation (GPT-4's baseline behavior for legal research) works like this: the model has been trained on a vast corpus of text including legal decisions, law review articles, and other legal materials. When asked for case citations, it generates text that looks like citations based on patterns in that training data. It has no mechanism to check whether the citation refers to a real case. It produces the most statistically plausible-looking output, which often resembles a real citation but may not be one.
RAG works differently. Before generating a response, the model retrieves actual documents from a connected database — in legal AI tools, this is the Lexis or Westlaw corpus. The model then generates its response based on the retrieved documents rather than from training data alone. If asked for cases on a topic, it finds real cases in the database first, then writes about them. This anchors the output to reality in a way that ungrounded generation cannot.
Why RAG reduces hallucination: the model is generating from real retrieved documents, so it cannot fabricate citations for cases that don't exist in the database — as long as the retrieval step works correctly. The case exists because it was retrieved from the database. This is why Lexis+ AI and CoCounsel achieve 17% versus GPT-4's 88%.
Why RAG does not eliminate hallucination: the remaining 17% error rate reveals the limit of grounded retrieval. Even when a model retrieves real documents, it can still misread, mischaracterize, or mis-extract the holding from those documents. It can retrieve a case that is related but not directly on point, then overstate its applicability. It can retrieve the right case but quote from the wrong section. It can retrieve multiple cases and conflate their holdings. The generation step still introduces error even when the retrieval step is accurate.
The 17% residual error rate is especially pronounced on ambiguous questions, cross-jurisdiction issues, and edge cases where the legal standard is genuinely contested. On clean, well-defined research questions with settled law and strong precedent, grounded tools perform considerably better. On novel questions, emerging areas of law, or research tasks that require synthesizing conflicting authorities, the error rate rises.
A full technical explanation of retrieval-augmented generation is at /glossary/retrieval-augmented-generation. For the practical purpose of managing hallucination risk, the takeaway is: grounded retrieval tools are substantially safer than ungrounded ones, but "substantially safer" is not the same as "safe enough to skip verification."
The American Bar Association's 2025 Technology Survey provides practitioner-level data on the frequency of AI hallucination in actual legal practice. The survey found that 41% of attorneys who reported using AI tools for legal research had encountered at least one hallucinated citation in the past year. This is self-reported data, which means the actual incidence rate is almost certainly higher — attorneys who do not catch hallucinations before filing cannot report them.
The 27 documented federal court sanctions cases from 2023 through May 2026 represent only the incidents that were both discovered by opposing counsel or the court and resulted in formal sanction. Cases where the AI-generated citation went undetected, where the brief was withdrawn after discovery, or where the judge addressed the issue informally are not included in the 27 figure.
The bar's response to this data has been primarily through ethics guidance rather than prohibition. The ABA's Model Rule 1.1, Competence, has been interpreted by multiple state bar ethics committees to require attorneys to understand the limitations of tools they use — including AI research tools. Attorneys who file AI-generated citations without verification may face competence challenges under Rule 1.1 independent of any court sanctions. Several state bars, including California and New York, have issued formal ethics opinions specifically addressing AI verification requirements. None have prohibited AI use outright; all have required human verification of AI-generated research before reliance.
For /solutions/big-law firms deploying AI research tools at scale, the ABA survey data suggests that a 41% annual hallucination encounter rate across users means that any firm of moderate size will have multiple attorneys encountering hallucinated citations each year. Without a firm-wide verification protocol, the variable of whether any given hallucination results in a filing is individual attorney diligence — an unacceptable risk profile for an Am Law 100 firm.
This chapter assesses each major legal research AI platform specifically on hallucination risk. These assessments are structured as risk analysis, not product recommendations. A lower hallucination rate does not mean a tool is appropriate for every use case — it means the citation error risk is lower, requiring adjustment of verification intensity accordingly.
Lexis+ AI — 17% error rate (Stanford RegLab 2024, independent)
Lexis+ AI is the co-leading tool in independent hallucination benchmarks. Its 17% error rate on the Stanford RegLab 2024 study reflects its retrieval-augmented architecture operating against LexisNexis's federal and state case law corpus, which covers all fifty states and all federal circuits with robust historical depth.
What the 17% figure means operationally: for every six citations Lexis+ AI generates, statistically one will contain a material error. In practice, this means that a research memo relying on ten Lexis+ AI citations should be expected to contain one or two errors before verification. This is substantially better than the alternative, but it means that every citation requires a citator check before any filing.
The 17% rate is a benchmark figure measured on controlled research tasks. For complex, cross-jurisdiction questions or research in areas where Lexis's corpus coverage is thinner (some international law, tribal law, specialized agency decisions), the error rate may be higher. Conversely, for straightforward federal case law research in well-developed areas of law, the error rate is likely lower.
Lexis+ AI's pricing starts at $149/month for the basic tier as of May 2026 (vendor-reported). Enterprise pricing is not published; firms report negotiated rates. Full review at /item/lexis-plus-ai. LawyerAI Accuracy score: 4.5/5.0.
CoCounsel — 17% error rate (Stanford RegLab 2024, independent)
CoCounsel, formerly Casetext, requires a Westlaw subscription and operates against Westlaw's case law corpus with its own distinct AI layer. Its 17% result in the Stanford benchmark is particularly notable because it demonstrates that the AI component matters independently of the corpus. Westlaw's native AI features tested at 33%; CoCounsel using the same Westlaw corpus tested at 17%. The difference is the AI architecture, not the data.
CoCounsel's limitation for many practitioners is the subscription dependency. A CoCounsel subscription requires an active Westlaw subscription, meaning the total cost of using CoCounsel is Westlaw + CoCounsel, which can exceed $600/month for individual practitioners. For firms already paying for Westlaw at enterprise rates, the incremental CoCounsel cost is more manageable.
The same caveat applies as with Lexis+ AI: 17% is a benchmark figure. Cross-jurisdiction research, regulatory edge cases, and novel legal questions will test higher. Every citation requires verification regardless of the tool's benchmark performance. Full review at /item/cocounsel. LawyerAI Accuracy score: 4.5/5.0.
Westlaw Precision AI — 33% error rate (Stanford RegLab 2024, independent)
Westlaw Precision AI tested at 33% in the Stanford RegLab 2024 study. This is a meaningful difference from the 17% benchmark — roughly twice the error rate. Operationally, at 33%, one in three citations generated by Westlaw's native AI features contains a material error. A ten-citation research memo should be expected to contain three errors before verification.
Westlaw's corpus is one of the largest and most comprehensive in legal research. The 33% figure is not a corpus problem — it is an AI layer problem, as demonstrated by CoCounsel's 17% using the same corpus. For practitioners who use Westlaw for its corpus depth and citator tools (KeyCite remains the most comprehensive citator in US legal research), the correct approach is to use Westlaw for corpus access and citator verification while treating the AI-generated citation suggestions as starting points requiring full verification, not endpoints.
Westlaw Precision pricing is not published for individual tiers; enterprise contracts are negotiated. Individual practitioner pricing has been reported in the $250–$450/month range (practitioner-reported, not independently verified). Full review at /item/westlaw-precision. LawyerAI Accuracy score: 3.5/5.0.
Harvey AI — No independent hallucination rate published as of May 2026
Harvey AI has deployed widely across Am Law 100 firms and US government agencies. It represents a different use case from Lexis+ AI and CoCounsel — it is primarily a document drafting and analysis tool rather than a citation research tool. However, it is being used for legal research tasks in many firm deployments.
Harvey reports using "grounded retrieval" in its product documentation, but no independent study of Harvey's hallucination rate on legal research tasks has been published as of May 2026. LawyerAI cannot assign an independent accuracy figure to Harvey for citation research. Vendor-reported accuracy figures exist but are labeled as vendor-reported throughout our methodology.
The implication for risk management: firms using Harvey for legal research citation work should apply the full 7-point verification checklist to every citation, and should not use Harvey's output for brief filings without independent citator verification. Harvey's strongest documented use cases are transactional document review and drafting — not case citation research. Full review at /item/harvey-ai. LawyerAI Accuracy score for citation research: not rated (insufficient independent data).
Paxton AI — No independent hallucination rate published as of May 2026
Paxton AI is a newer entrant focused primarily on government and public sector legal research. It has no independent hallucination rate published as of this report. Its corpus coverage is more limited than Westlaw or Lexis, which may affect both retrieval accuracy and hallucination risk on questions requiring broad case law coverage.
LawyerAI recommends treating Paxton AI as requiring full citation verification for any filing. Full review at /item/paxton-ai.
vLex Vincent — No independent hallucination rate published as of May 2026
vLex has a genuine competitive advantage in international and multi-jurisdiction legal research, with corpus coverage of jurisdictions where Westlaw and Lexis coverage is limited. For international law, comparative law, and cross-border matters, vLex's retrieval breadth may justify its use even without independent hallucination data.
The absence of independent accuracy data should be treated as it is: an absence of data, not evidence of safety. Until an independent study benchmarks vLex's hallucination rate, it should not be used for filed documents without full citation verification.
Every AI-researched citation — regardless of which tool generated it — should pass all seven of the following verification steps before it appears in a filed document. This checklist applies whether the citation came from a tool with a 17% error rate or an 88% error rate. The checklist is calibrated to the fact that hallucinations occur even in the best tools, and the consequences of a filed hallucination are professional responsibility issues regardless of which tool produced it.
The /glossary/citation-validation page has the full technical background on each step.
Step 1: Does the case exist? Run the case name and citation through Westlaw's KeyCite or Lexis's Shepard's citator. A case that does not exist will not appear. This is the foundational check that catches Category 1 hallucinations (fabricated citations). It takes approximately thirty seconds per citation.
Step 2: Is the citation format correct? Verify the reporter, volume, page number, court, and year against the citator result. AI tools sometimes get the case right but the citation slightly wrong — wrong reporter abbreviation, off-by-one page number, or incorrect year. An incorrect citation format is a professional presentation problem even if the underlying case is real.
Step 3: Is the holding accurately stated? Read the actual case text — not the AI's summary of it. The AI's characterization of what the case holds is the primary source of Category 2 hallucinations (misattributed holdings). The only way to verify a holding is to read the opinion at the relevant pages. For lengthy opinions, the AI's page citations (if provided) allow you to jump directly to the relevant section, but you must read the original.
Step 4: Is the case still good law? Run KeyCite or Shepard's for subsequent history. Cases get overruled, distinguished, or limited. An AI tool trained on data with a cutoff date may cite cases that have since been significantly limited. A red flag or red stop sign in KeyCite is a bright-line stop signal.
Step 5: Is the case from the correct jurisdiction and at the correct precedential level? Verify that the court and jurisdiction are what the AI represented. AI tools sometimes cite persuasive authority as if it were binding, or cite a lower court decision when a controlling appellate decision exists. Confirm that the court's hierarchy is what you intended to cite.
Step 6: Does the case support the specific proposition cited? This is the most demanding check. Read the specific passage of the opinion that the AI claims supports your proposition. AI tools frequently cite cases that are related to the topic but do not directly support the specific point being argued. A case about contract formation does not necessarily support every proposition about contract interpretation.
Step 7: Has the AI paraphrased rather than quoted? If the AI gave you a quotation from the case, verify the quote against the original text word for word. If the AI paraphrased the case rather than quoting it directly, treat the paraphrase as an AI interpretation — not as the court's words. Paraphrases have a higher error rate than direct quotations because they require the AI to re-express the court's language, which is another opportunity for error.
The decision about which legal AI tool to use for research — and how much verification to apply — should be driven by the filing risk of your output, not by which tool is cheapest or most convenient.
Branch 1: Federal court brief filing Use only tools with independent accuracy data: Lexis+ AI or CoCounsel. Apply the full 7-point verification checklist to every citation. Do not substitute AI-generated citations for independently verified ones in any document filed with a federal court. /solutions/litigation teams should have a firm policy that addresses this specifically.
Branch 2: Internal memo or unfiled legal research Verified grounded tools (Lexis+ AI, CoCounsel, Westlaw Precision) can be used with spot-check verification rather than full 7-point verification. Apply Steps 1, 3, and 4 at minimum. Tools without independent accuracy data (Harvey AI, Paxton AI) can be used for initial research framing, but should not be the primary citation source for any research that will be cited in advice.
Branch 3: Due diligence research (not filed) Any grounded retrieval tool with playbook-level review is appropriate. The risk is different from litigation — due diligence errors affect deal analysis, not court filings. Apply Steps 1 and 4 (existence and good law) at minimum. Full verification is recommended for legal conclusions that will drive transaction decisions.
Branch 4: International or multi-jurisdiction research vLex has the broadest international corpus coverage. In the absence of independent hallucination data, apply the full 7-point checklist to any vLex-generated citations for filed documents. For purely informational multi-jurisdiction surveys, grounded retrieval with Step 1 verification is the minimum.
Branch 5: Consumer or informational legal content (not advice) General AI tools with heavy verification are acceptable for informational content that is not legal advice. Do not publish AI-generated legal content without a licensed attorney's review of the underlying legal conclusions, regardless of the tool used.
1. Has my favorite legal AI tool hallucinated in independent tests?
As of May 2026, only three major tools have published independent hallucination rates from third-party studies: Lexis+ AI at 17%, CoCounsel at 17%, and Westlaw Precision AI at 33% (all Stanford RegLab 2024). For Harvey AI, Paxton AI, vLex, and most other legal AI tools, no independent hallucination study has been published. The absence of data does not mean the tool doesn't hallucinate — it means no one has independently measured it yet. Treat any tool without independent data as having an unknown hallucination rate, which requires the same verification caution as a high measured rate.
2. How do I verify a legal AI citation in 30 seconds?
The fastest minimum-viable verification: paste the citation into Westlaw's KeyCite or Lexis's Shepard's search bar. If the case appears with a green flag or signal, confirm the citation format matches exactly. If it doesn't appear at all, the citation is fabricated. This covers Step 1 of the 7-point checklist. For brief filings, 30 seconds is not sufficient — all seven steps are required. For preliminary research, the existence check is the minimum floor. The /glossary/citation-validation glossary page has a detailed walkthrough.
3. Which tool is safest for federal court brief filings?
Lexis+ AI and CoCounsel have the lowest documented independent hallucination rates at 17% per Stanford RegLab 2024. Of the two, Lexis+ AI does not require a second subscription, which makes it more accessible for practitioners not already on Westlaw. However, "safest available" does not mean "safe without verification." At 17%, one in six citations still requires correction. The answer to which tool is safe for federal court brief filings is: any grounded retrieval tool paired with the full 7-point verification checklist. No tool used without verification is safe for federal court filings.
4. Is AI hallucination getting better over time?
The trend is toward improvement, but the pace is uneven and the problem has not been solved. The gap between GPT-4's 88% and the leading grounded tools' 17% demonstrates that architectural choices (grounded retrieval vs. ungrounded generation) matter more than raw model capability. The 17% floor for grounded tools has remained relatively stable despite improvements in underlying model capability — suggesting that the remaining error is an inherent challenge of language model generation, not just a training data or corpus problem. Year-over-year improvement in the 17% figure has been incremental. No tool has published an independent hallucination rate below 10% for legal research as of May 2026. See /glossary/llm for more on how language model architecture affects output accuracy.
5. Can I tell my firm to ban legal AI to avoid the hallucination risk?
You can, but you would be creating a different risk while avoiding this one. Attorneys at firms that prohibit AI tools face a competitive and workload disadvantage against firms using them effectively. The ABA's competence standard (Model Rule 1.1) has been interpreted by multiple bar associations to include understanding and appropriately using the tools available in legal practice — which increasingly includes AI research tools. A blanket ban does not eliminate the hallucination risk if attorneys use personal AI accounts outside firm systems; it only removes firm visibility into the risk. The better approach is a firm policy that specifies approved tools, required verification steps, and filing protocols. The /compare/westlaw-vs-lexis-ai comparison can help evaluate the two most independently verified options for a firm's approved tool list.
LawyerAI evaluations are independent. We do not accept payment that influences our editorial scores. Featured placements (when introduced) will be clearly labeled and will not affect our 5-dimension scoring methodology. Our rankings reflect product reality at time of writing — we re-review every quarter and update lastReviewedAt accordingly.
If you spot an error, email editorial@lawyerai.directory. We correct in public and credit the reporter.