Vendor training on customer data refers to whether and how an AI legal tool provider uses content submitted by customers — including contracts, research queries, briefs, client communications, and uploaded documents — to train, fine-tune, or otherwise improve its underlying AI models. This practice, which is common across consumer AI products, raises distinct professional responsibility concerns when the content in question is protected by attorney-client privilege or subject to confidentiality obligations.
The question is not theoretical. Large language models improve through exposure to more data, and customer-submitted content is valuable training material because it reflects real-world usage patterns. The commercial incentive for vendors to use customer data for model improvement is substantial. The professional responsibility obligation to protect client confidences makes that incentive directly in tension with lawyers' ethical duties.
Understanding a vendor's actual data practices — not just its marketing claims — requires reading vendor terms of service and data processing agreements carefully, asking specific contractual questions, and obtaining written representations. The four-category taxonomy below provides a framework for evaluating any AI legal tool's data handling approach.
ABA Model Rule 1.6 requires a lawyer to make "reasonable efforts to prevent the inadvertent or unauthorized disclosure of, or unauthorized access to, information relating to the representation of a client." The comment to Rule 1.6 specifically addresses "communicating client information using the internet or other electronic means" and requires lawyers to assess the security of communication methods.
If a lawyer uploads a client's draft merger agreement to an AI contract review tool that uses customer inputs for model training, the client's confidential commercial terms could influence that model's outputs for other users. Even with PII removal, the commercial terms, negotiating positions, and transaction structure of a sensitive deal may be identifiable or recoverable from a model trained on the data. The theoretical risk is not merely academic — research on model inversion attacks and training data extraction has demonstrated that large language models can sometimes reproduce training data when prompted in specific ways (Carlini et al., "Extracting Training Data from Large Language Models," USENIX Security 2021).
ABA Model Rule 5.3 adds an additional layer: a lawyer who uses an AI tool is responsible for supervising the work the tool performs, including ensuring the tool's practices are consistent with the lawyer's professional obligations. A lawyer who has not verified a vendor's data training practices has not met the supervision standard.
The practical stakes vary by matter type. For routine document assembly or generic legal research, the risk of training-related disclosure may be lower. For M&A work, patent prosecution, regulatory investigations, employment disputes involving named executives, or any matter where the specific factual content is commercially or personally sensitive, the training data question deserves explicit due diligence.
Several state bar opinions have addressed this directly. The Florida Bar issued guidance in 2023 (Professional Ethics Committee, Op. 24-1) noting that lawyers must investigate whether AI tools use client data for training before using those tools on client matters. The State Bar of California issued a formal guidance document in 2023 recommending that lawyers "understand the data practices of any AI tool used on a client matter, including whether and how the vendor uses client data."
How It Works (Technical)
The mechanics of model training are important for evaluating vendor claims. There are four materially distinct practices:
Category 1 — No training, no retention (true zero retention): Customer inputs and outputs are processed in memory only. No data is written to persistent storage. After the session ends, nothing remains. This is the strongest privacy posture. It means the vendor also cannot debug errors using customer data, which creates a trade-off discussed in the Zero Data Retention glossary entry. This is available only at enterprise tiers and requires contractual verification.
Category 2 — No training, data retained temporarily: Customer data is stored for a defined period (hours, days, or up to 30 days) for purposes such as session continuity, abuse detection, and technical debugging. The vendor commits not to use this retained data for model training. This is the most common enterprise commitment: it provides less privacy assurance than true zero retention but allows the vendor to investigate reported errors. The key questions are: what is the retention period, who has access to retained data, and what controls prevent unauthorized use?
Category 3 — Opt-out training: The default is that customer data may be used for model training. Customers who want to opt out must affirmatively change a setting or execute an enterprise agreement. This is the default for many consumer and prosumer AI products. Lawyers using tools in this category on standard subscription plans should assume their inputs are available for training unless they have explicitly opted out and received written confirmation.
Category 4 — Training permitted, anonymized: Customer data is used for training after a PII removal process. The quality of anonymization is the critical variable. Legal documents often contain commercially sensitive information that is not technically PII (client names, as defined by GDPR Article 4(1)) but is nonetheless confidential. Deal terms, regulatory positions, and litigation strategies are not "personal data" but are absolutely subject to attorney-client confidentiality. PII removal does not protect confidential client information — it only removes identifiers that meet the legal definition of personal data.
How to read terms of service for training practices: Certain formulations in privacy policies and terms of service reliably signal each category. "We may use your inputs to improve our services" is Category 3 language — training unless you opt out. "We process your data only to provide the service" with no carve-outs for improvement or training is closer to Category 1 or 2, but should be verified in the data processing agreement. "Aggregated and de-identified data" means the vendor retains the right to train on processed versions of your data — this is not zero retention. "Enterprise accounts are subject to separate data processing terms" is accurate but incomplete — you must obtain and review those separate terms.
How Legal AI Vendors Address It
Harvey AI provides an enterprise-tier commitment that customer data is not used for model training. Harvey's enterprise agreements contain explicit contractual provisions to this effect, and the company has publicly described its data handling in legal industry forums as "no training on customer data." The commitment applies at enterprise pricing — standard accounts may have different terms. Verification requirement: obtain the current enterprise agreement, identify the specific contractual clause addressing training use, and confirm it covers all content types (prompts, uploaded documents, and AI outputs). Limitation: Harvey relies on underlying model providers (primarily Anthropic's Claude and OpenAI APIs); verify that those providers' enterprise terms also prohibit training on API customer data, which both Anthropic and OpenAI do at the API tier but not at consumer product tier.
Lexis+ AI (LexisNexis) maintains explicit commitments against using customer legal research content for model training in its enterprise legal products. LexisNexis has a strong institutional interest in protecting the confidentiality of law firm research patterns — their business model depends on attorney trust. The company's enterprise terms reflect this. Limitation: verify which specific content types are covered — does the no-training commitment apply to uploaded documents and custom prompts, or only to Lexis database search queries?
LegalFly is EU-based and operates under GDPR's data minimization principle (Article 5(1)(c)), which limits processing to what is "necessary in relation to the purposes for which they are processed." LegalFly's privacy documentation reflects GDPR's prohibition on secondary use of personal data without a separate legal basis. For EU matters, GDPR provides a structural legal constraint on training use that is absent in US-based vendors' contracts. Limitation: GDPR applies to personal data — it provides a stronger constraint for documents containing personal data of EU individuals than for purely commercial documents.
Spellbook (Rally) includes no-training commitments in its enterprise contracts. The free and standard subscription tiers have different terms — lawyers using Spellbook on a standard subscription should review current terms rather than assuming enterprise commitments apply. Limitation: Spellbook's underlying model infrastructure has involved OpenAI APIs; as with Harvey, verify the full data chain including the underlying model provider's enterprise terms.
How Lawyers Should Verify Vendor Training Practices
-
Ask the specific question in writing before signing any contract. Send the vendor a written question: "Does your platform use content I submit — including prompts, uploaded documents, and AI outputs — to train, fine-tune, or improve your AI models or any underlying AI models?" Request a written answer. A vendor that will not provide a written answer to this question should not be trusted with client data.
-
Read the data processing agreement (DPA), not just the main terms of service. Enterprise AI vendors typically have a separate DPA or data processing addendum that contains the operative data handling commitments. The marketing page says "we take security seriously." The DPA says what the vendor is actually contractually obligated to do. If the vendor does not have a DPA, that itself is a red flag for legal use.
-
Identify the full data chain, including underlying model providers. Most legal AI tools are built on top of foundation models from Anthropic, OpenAI, Google, or others. The legal AI vendor's no-training commitment covers the vendor layer — it may not cover the underlying model provider. Verify that the underlying provider's API terms also prohibit training on customer inputs. Both Anthropic and OpenAI Enterprise API terms prohibit training on customer API data as of 2025; confirm this for the current contract version.
-
Require a contractual indemnification or notification obligation. Negotiate a contractual provision requiring the vendor to notify you within a defined period (e.g., 72 hours) if it discovers that customer data was inadvertently used for training purposes, and an indemnification provision for breach. This does not fully address the risk but creates accountability and aligns vendor incentives.
-
Establish matter-level use policies for AI tools. Not all client matters carry the same confidentiality sensitivity. Develop internal policies specifying which categories of matters — by deal sensitivity, client instruction, jurisdiction, or practice area — require verification of vendor training commitments before AI tools may be used, and which matters may proceed with standard tool deployment under existing contracts.