Why Document-Only Legal AI Misses Half the Case Record
by Ali Rind, Last updated: May 19, 2026 , ref:

In September 2025, Judge Victoria Kolakowski of Alameda County Superior Court watched a video exhibit submitted as a witness statement and noticed the witness's face barely moved. The voice was disjointed. The mannerisms repeated. The court eventually identified it as one of the first AI-generated deepfakes submitted as authentic evidence in a U.S. case, and dismissed Mendones v. Cushman & Wakefield over it.
The case is talked about as a deepfake story. It is also, quietly, a multi-modal evidence story. The judge had to evaluate video, audio, and the documents that came with the exhibit, together. A modern litigation matter routinely requires that same kind of cross-modal analysis. The case record has not been document-only for some time.
Most of the legal AI tools attorneys evaluate in 2026 are. That is the gap this post is about.
What the modern legal case record actually contains in 2026
Walk into a complex matter today and the file looks nothing like the file from ten years ago. Personal injury cases include medical records, billing summaries, deposition transcripts, and the videos of those depositions, interview recordings, hospital surveillance footage, and dashcam clips from the accident.
Employment cases now routinely include Slack and Teams exports, recorded all-hands meetings, voicemails, and HR interview recordings. Criminal defense files contain body cam footage, jail call audio, surveillance video, scanned police reports, and digital extractions from phones. Family law has voicemails, text screenshots, and increasingly, recordings submitted as proof of threats or violations.
None of this is unusual anymore. Surveillance footage, deposition recordings, drone footage, dashcam video, and body camera content are standard evidence types in commercial litigation, criminal defense, employment disputes, and internal investigations. The volume is going up, not down. For a deeper look at how this changes e-discovery specifically, see our guide on video evidence in e-discovery.
The point is not that documents are obsolete. They still anchor the record. The point is that the documents alone no longer tell the story, and a tool that only reads documents only sees part of the case.
Harvey, Spellbook, and CoCounsel: what these legal AI tools actually do
The category framing is the whole question here. Harvey's litigation product line is built around analyzing complaints, discovery, expert reports, and deposition transcripts. Harvey's own litigation product page describes deposition workflows that synthesize and summarize transcripts to generate cross-examination questions. The transcript is the input, not the video.
Spellbook is a Microsoft Word side-panel for contract drafting and redlining. Excellent at what it does. It does not handle audio or video at all.
CoCounsel, now part of Thomson Reuters, focuses on legal research, document comparison, timeline generation, and database integration. Categorized in industry roundups as restricted to textual content and specific database connections.
None of this is a weakness. These are accurate category descriptions. The tools were built for the part of the case that lives in text, and they are strong inside that boundary. The question for a lawyer evaluating them in 2026 is whether the part of the case that lives outside that boundary is small enough to ignore.
For most modern litigation, the answer is no.
5 legal AI workflows where document-only tools fall short
Concrete examples are more useful than abstractions. Here are five places the gap is operational, not theoretical.
Comparing the video deposition to the transcript
A witness says something on camera that the certified transcript does not capture cleanly, or a pause, hesitation, or off-record comment changes the meaning of the testimony. A document-only tool reads the transcript and stops. The video stays unindexed unless someone watches it.
Cross-referencing recorded client interview with the signed statement
The interview was recorded. The statement was typed up from it. Are they consistent? A modern multimodal AI compares the audio against the document automatically. A document-only AI can only read the statement.
Finding the moment in body cam footage that matches a witness account
A defense attorney has a four-hour body cam clip and a written witness statement claiming the officer issued a verbal warning at a specific point. Locating that moment manually means scrubbing through the video. With AI that can search across transcripts, detected objects, and audio together, it takes a single query. We covered the mechanics of this workflow in more detail in the AI chatbot for video evidence search post.
Reading scanned handwritten physician notes alongside structured EHR data
Medical malpractice and Social Security disability work both depend on this. The structured chart says one thing. The handwritten note in the margin says another. Document-only AI either misses the handwritten content or reads it without context. Multi-modal AI processes both as part of the same record.
Building a timeline that spans documents and recorded evidence
Reconstructing what happened on a given day usually means pulling timestamps from emails, photos, surveillance footage, body cam clips, and witness statements together. A document-only tool produces the document portion of the timeline. The full picture requires a tool that reads the rest.
In each of these, the answers the document-only AI gives are accurate as far as they go. They just leave large parts of the record untouched.
What multi-modal legal AI requires: 5 core capabilities
The phrase gets used loosely. A tool that can transcribe audio is not the same as a tool that can analyze it alongside the rest of the case record. A useful working definition has five components:
- Search inside video, not just video metadata. Knowing a file is named "interview_03.mp4" does not help. Knowing the speaker said "I never agreed to that" at 00:42:11 does.
- Speaker-separated audio analysis. Most legal audio has two or more voices, and the question of who said what is usually the whole point.
- OCR for handwritten and scanned content. A significant share of legal records still arrives this way, especially medical files, court exhibits, and older case material.
- Image and frame-level recognition. Faces, vehicles, license plates, weapons, and other objects relevant to the matter, identified and timestamped automatically.
- Source citation to the exact timestamp, page, or frame. Every answer ties back to a specific second of video, page of a document, or frame of an image. The reviewer can verify in one click instead of reading 40 pages of source material.
The last one matters more than it used to. After 2025's wave of court sanctions for hallucinated citations in legal filings, every credible legal AI platform now claims to ground its answers. The question is what the grounding actually points to. Citing a paragraph in a document is the standard. Citing a specific second of a video, with the clip viewable from the answer, is the multi-modal equivalent.
Do you need multi-modal legal AI, or is document-only enough?
Document-only AI is succeeding at the slice of the work it was built for. The catch is that the slice keeps shrinking as a percentage of what a modern matter actually contains. Lawyers evaluating AI in 2026 should know what their tool covers and what it will leave unindexed.
If your firm is staying with enterprise SaaS for the document work and wants the rest of the case record covered too, request a demo of VIDIZMO Intelligence Hub to see how multi-modal analysis works on a real matter from your practice.
People Also Ask
No. Harvey works on documents and deposition transcripts, not the video files themselves. Per Harvey's litigation product page, deposition workflows operate on transcripts, generating summaries and cross-examination questions from the text. Video analysis is not part of the current product surface.
Legal AI that can read and reason across video, audio, documents, and images as a single integrated record, rather than as separate file types. A multi-modal platform can compare a deposition video to a typed transcript, find a moment in body cam footage that matches a written statement, or process scanned handwritten notes alongside structured documents in one query.
No. Spellbook is a Microsoft Word side-panel built for contract drafting, redlining, and clause benchmarking. It is excellent for transactional document work and was not designed to process audio, video, or scanned image evidence.
Because the evidence in these files often contradicts or extends the written record, and document-only AI cannot see it. Body cam tone, deposition demeanor, off-record statements, and recorded interview content frequently change how a case is argued. Finding those moments manually across hours of footage is the workflow bottleneck multi-modal AI exists to solve.
About the Author
Ali Rind
Ali Rind is a Product Marketing Executive at VIDIZMO, where he focuses on digital evidence management, AI redaction, and enterprise video technology. He closely follows how law enforcement agencies, public safety organizations, and government bodies manage and act on video evidence, translating those insights into clear, practical content. Ali writes across Digital Evidence Management System, Redactor, and Intelligence Hub products, covering everything from compliance challenges to real-world deployment across federal, state, and commercial markets.

No Comments Yet
Let us know what you think