Large language models have become the engine beneath virtually every meaningful legal AI product launched since 2023. Understanding what an LLM is — and what distinguishes a well-implemented legal LLM from a raw general-purpose model — is now a baseline competency for any lawyer evaluating or deploying AI tools in practice.
The practical stakes are high. An LLM used without legal-specific grounding or fine-tuning can produce plausible-sounding citations to cases that do not exist, misstate holdings, or generate contract clauses that are legally incorrect for a given jurisdiction. The Stanford RegLab's 2024 independent study of legal AI hallucination rates measured a baseline error rate of 88% for GPT-4 on legal citation tasks — meaning nearly nine out of ten citations generated by a raw general LLM were wrong, missing, or misattributed.
Conversely, LLMs that are properly grounded, fine-tuned on legal corpora, or architected with retrieval-augmented generation (RAG) produce dramatically better results. The same Stanford RegLab study measured Lexis+ AI at a 17% error rate and Westlaw Precision AI at 33% — still imperfect, but a categorical improvement over ungrounded models.
For lawyers, this means the base model matters, but it is not the whole story. The architecture built around the model — the retrieval systems, grounding mechanisms, confidentiality controls, and legal corpus quality — determines whether an LLM-powered tool is appropriate for professional legal work.
How It Works
Large language models are neural networks with billions or even hundreds of billions of parameters — numerical values that encode learned relationships between words, concepts, and contexts. Training involves exposing the model to enormous text datasets (web crawls, books, academic papers, legal databases) and adjusting parameters to minimize prediction error on next-token prediction: given this text, what word comes next?
The result is a model that has encoded a statistical representation of language — and, implicitly, of the knowledge contained in that language. This is why an LLM can answer questions about contract law without having been explicitly programmed with legal rules: it has seen enough legal text during training to develop an internal representation of legal concepts.
How legal AI products use LLMs:
At inference time (when a user submits a query), the LLM receives a prompt — a structured input containing the user's question, any relevant documents, system instructions, and context. The model generates a response by predicting the most likely sequence of tokens (word fragments) given that input. This generation process is probabilistic, which is why LLMs can produce different responses to identical prompts and why hallucination is an inherent architectural risk rather than a software bug.
Legal AI tools add layers on top of this base process:
-
Retrieval augmentation: Before the LLM generates a response, a retrieval system searches a legal database (case law, statutes, contracts) and injects the most relevant documents into the prompt. This grounds the LLM's output in actual legal sources rather than training-data recall.
-
Fine-tuning: The base LLM is further trained on legal-specific data — court opinions, contract templates, legal memoranda — to improve its performance on legal tasks and reduce hallucination on domain-specific queries.
-
Instruction tuning: The model is trained to follow specific legal task instructions (summarize this contract, identify the governing law clause, flag unusual indemnification terms) rather than just completing text.
-
Confidentiality architecture: Enterprise legal AI tools implement access controls, data isolation, and zero-data-retention agreements to ensure client documents processed through the LLM are not used for further model training.
Examples in production:
Harvey AI is built on GPT-4 and GPT-4o, with additional legal fine-tuning and enterprise confidentiality controls deployed by major law firms including Allen & Overy, Linklaters, and Paul Weiss. CoCounsel, developed by Casetext and now owned by Thomson Reuters, uses a combination of GPT-4 and integration with Westlaw's legal corpus — grounding its outputs in one of the world's largest legal databases. Spellbook uses GPT-4 to power contract drafting assistance directly within Microsoft Word, with prompts specifically tuned for commercial contract tasks.
Key Considerations for Law Firms
Base model vs. implementation: When a vendor claims "our tool is powered by GPT-4," the base model tells you relatively little. The critical variables are: what data was used for fine-tuning, how is grounding implemented, how is hallucination validated, and what confidentiality controls govern your client data? Evaluate the full stack, not just the foundation model name.
Legal corpus quality: A legal LLM is only as good as the legal data it was trained on or has access to at inference time. A model trained primarily on general web text will perform worse on specialized legal tasks than one trained or grounded on comprehensive case law, statutes, and secondary legal sources. Ask vendors specifically what legal corpora they use.
Confidentiality and data use: The core question every firm must answer before deploying an LLM-powered legal tool is: does my client data get used to train the model? This is both an ethics obligation (ABA Model Rule 1.6) and a practical risk. Reputable enterprise legal AI vendors provide explicit zero-data-retention commitments and contractual guarantees that client data will not be used for training.
Jurisdiction specificity: General LLMs are trained on globally sourced data, which creates risk for jurisdiction-specific legal tasks. An LLM may conflate English law with US law, or generate an answer correct for federal practice but wrong for California state court procedure. Legal AI tools designed for specific jurisdictions typically either fine-tune on that jurisdiction's data or implement guardrails that flag jurisdiction-specific uncertainty.
Transparency about the underlying model: Some legal AI vendors are opaque about which foundation model they use. This opacity makes independent accuracy assessment difficult. Prefer vendors who disclose their underlying model, grounding architecture, and the independent benchmarks that validate their accuracy claims.
Limitations and Risks
Hallucination is structural, not a software bug: Because LLMs generate text probabilistically, they will produce factually incorrect output even when they appear confident. This is not a problem that can be fully engineered away; it can only be reduced through grounding, retrieval augmentation, and validation systems. Any legal AI workflow that does not include human verification of LLM-generated citations and legal analysis is professionally and ethically problematic.
Training data cutoffs: LLMs are trained on data collected up to a specific date. New case law, statutory amendments, and regulatory developments after the training cutoff will not be reflected in the model's internal knowledge. This makes grounding in continuously updated legal databases (Westlaw, LexisNexis) essential for legal research applications.
Reasoning limitations: Current LLMs perform well on pattern recognition and text generation but are less reliable on multi-step legal reasoning that requires tracking complex conditional relationships across a long document. Tasks like analyzing cross-referenced definitions in a 200-page credit agreement remain challenging.
Overconfidence: LLMs typically do not express calibrated uncertainty. A model may deliver an incorrect legal conclusion with the same confident tone as a correct one. Lawyers must cultivate healthy skepticism about any LLM output and establish verification workflows rather than treating LLM outputs as authoritative.
Cost and latency at scale: Processing large legal documents through LLMs incurs computational cost and latency that increases with document size. Firms running high-volume contract review workflows need to evaluate vendor pricing models carefully.