Guide

Is Claude Code reliable for law firms?

A practical guide for legal operations teams: what Claude Code does brilliantly, where its reliability breaks down for back-office work, and how to think about automating the firm without getting burned.

The Caddi Team•June 10, 2026

AI has arrived in the law firm, and it's no longer just about drafting. Operations and administrative teams, the people who run intake, filing, billing, and the inbox, are being asked whether tools like Claude Code can take repetitive work off their plate. It's a fair question, and the honest answer has two halves.

Claude Code is genuinely impressive. For research, drafting, summarizing, and technical tasks where a person reviews the result, it's a real productivity boost. But "reliable enough for a person to lean on" and "reliable enough to run your operations unattended" are different bars. This guide is written for the second one, and for the ops professional who wants to use AI well without putting the firm at risk.

‍

Where law firms are trying to use Claude Code in operations

Before the frustrations, it helps to name the work. Most firms aren't trying to get Claude Code to practice law, they're trying to get back the hours their teams spend on repeatable, system-to-system tasks. The common candidates:

New-matter and client intake. Reading intake forms, running conflicts, opening the matter, and setting up the document workspace.
Document filing and naming. Profiling, naming, and filing documents and email into iManage or NetDocuments by the firm's conventions.
Time and billing prep. Assembling prebills, chasing missing time, and reconciling against the practice-management system.
Inbox triage. Sorting a shared inbox, new matters, client questions, court notices, and routing each to the right place.
Deadlines and docketing. Pulling dates out of emails and orders and getting them onto the right calendar.

‍

The reliability frustrations ops teams run into with Claude Code

If you've piloted Claude Code on this kind of work, the following will feel familiar. None of them mean you're using it wrong, they're properties of the tool. Here's each one, and what's actually going on.

Claude Code isn't following your instructions

You give clear, numbered instructions and it honors most of them, then quietly drops one. That's because instructions are a prompt the model interprets and weighs against everything else in context, not a rule it's bound to execute. The longer and more detailed your instructions, the more likely a step gets dropped.

Claude Code keeps changing the format

The same task comes back formatted differently each time, a date as 2026-06-10 on one run and June 10, 2026 on the next, columns in a new order, a file named a new way. The model regenerates the output rather than filling a fixed template, so naming and structure drift even when the content is right. For a firm with a strict filing convention, that drift is a real problem.

Claude Code gives inconsistent results

Run the same task twice and you can get two different answers. This is inherent to how a large language model works: it generates the most likely output rather than executing a fixed procedure, and small differences in context nudge the result. Inconsistency is fine when you're exploring; it's a blocker when the work has to come out the same way every time.

Claude Code produces different output every time

Closely related: even when the answer is correct, its shape changes, wording, ordering, structure. Anything downstream that expects a fixed shape (a spreadsheet, a filing system, a report template) breaks when the shape moves.

Claude Code won't follow your rules

You write the rules down, in the prompt, in a rules file, and it still goes off-script. A rules file is context the model reads and weighs, not an enforcement layer. When two rules seem to conflict, or a rule conflicts with the task, the model silently picks, and you don't get to control how.

Claude Code makes mistakes

It generates plausible output, and plausible isn't the same as correct. Most of the time it's right; occasionally it's confidently wrong in a way that's easy to miss, and nothing flags the difference. On client-facing or filed work, those misses ship unless a person checks every run.

Claude Code hallucinations

Sometimes the most plausible-looking output is invented, a citation, a docket number, a party name that reads as normal but isn't real. Hallucinations can be reduced but not guaranteed away, which is exactly why hallucination-sensitive legal work shouldn't depend on a model generating the answer on every run.

‍

Why this happens: Claude Code is non-deterministic by design

Every frustration above traces back to one root cause. Claude Code is built on a large language model, and language models are non-deterministic: they generate output by predicting what's most likely, sampling from a range of possibilities. The same input can produce different output on different runs. That's not a defect, it's the property that makes them creative, flexible, and good at language.

It's also why they're a strength for drafting and analysis and a liability for unattended operations. A person reviewing each result absorbs the variability. An automation running hundreds of times a week with no one watching cannot, and in a regulated environment, "mostly right" is not a standard you can attest to. The reliability problem isn't a prompt you haven't found yet; it's structural.

	Claude Code	Caddi
What runs in production	A model generates fresh output on every run	Deterministic code, generated once at setup
Same input, same output?	Not guaranteed, output can vary run to run	Yes, identical inputs yield identical results
Following your rules	Instructions are a prompt the model may reinterpret	Rules are compiled into the workflow, not re-read each time
Auditability	Hard to prove what happened or why	Full run-by-run audit trail (SOC 2)
Who maintains it	You re-prompt and babysit it	Built and maintained for you

Claude Code vs. Caddi on the dimensions that matter for unattended, regulated operational work.

‍

Is Claude Code too complicated for a non-technical ops team?

There's a second, quieter barrier. Claude Code is a developer tool: it lives in a terminal, expects you to write and refine prompts, and leaves you to maintain whatever you build. For an engineer that's natural; for the paralegal, billing coordinator, or operations manager who actually runs the workflow, it's a steep climb.

That matters because the people closest to the work are usually the ones who can't build for it, so the automation either never happens or lands on one overloaded technical person. The goal for most ops teams isn't to become prompt engineers; it's to stop doing repetitive work. A tool that requires the former to get the latter is a hard sell.

‍

Can you trust Claude Code with client and matter data?

For a law firm the bar isn't just "is the model secure", it's whether you can scope and control access, prove exactly what happened with a client's data on every run, and get the same handling every time. Prompt-driven, run-to-run-variable execution is hard to audit and hard to constrain, which is the opposite of what your malpractice carrier and your clients expect.

The danger is rarely a dramatic breach. It's the quiet failure: a hallucinated value in a record, inconsistent handling that surfaces in an audit, or credentials embedded in an ad-hoc script that no one can fully account for. Those are the failures that draw findings and erode client trust, so the operating model has to make them structurally hard.

‍

Why firms give up, and what reliable automation actually looks like

This is why so many firms prototype something promising in Claude Code and then quietly abandon it. The demo works; production doesn't. Connecting the DMS, the practice-management system, and the inbox is real engineering, reliability takes constant prompt-tuning, and every tool change breaks something the firm then has to fix. Giving up isn't a failure of effort, it's the wrong tool for the last, hardest 80% of the job.

The fix is to change the order of operations. Caddi uses AI once, at setup, to understand a workflow you demonstrate over a screen share, the way you'd train a new hire. From then on it runs that workflow as deterministic code over real connections: the same inputs produce the same outputs every time, every run is audit-logged, and genuine exceptions are routed to a person. The reasoning happens at design time, where variability is fine; production is just code, where it isn't. No terminal, no prompts to maintain, and Caddi keeps the automation working as your tools change.

Caddi turns your screenshares into AI automations: show it the workflow once, and it runs as deterministic code across your tools, maintained for you.

Keep Claude Code for research, drafting, and supervised technical work, it's excellent there. For the repetitive, rule-bound back-office workflows you want to run unattended and prove later, that's a job for deterministic automation, and it's exactly what Caddi is built for.

‍

Keep reading

‍

See deterministic automation in action

Caddi builds reliable automations from a screen recording and runs them across 70+ tools as deterministic code. Explore real workflows for law firms and RIAs & financial advisors, or book a demo to see your own workflow built live.

‍

Do more with less

See Caddi in action

Tell us where to reach you and the calendar opens right here. In 30 minutes we'll show you how Caddi automates the back-office work that grows with your clients—built, run, and maintained for you.

Frequently asked questions

Is Claude Code reliable enough for law firm operations?

For supervised work like research and drafting, yes. For unattended, rule-bound back-office workflows it's limited by non-determinism, the same input can yield different output run to run, so it's hard to run without a person checking every result.

Why does Claude Code give inconsistent results?

Because it's a large language model that generates output probabilistically rather than executing a fixed procedure. The same task can produce different results on different runs.

Why won't Claude Code follow my firm's rules or naming conventions?

Rules and conventions in a prompt or rules file are guidance the model weighs, not an enforcement layer it must obey. They improve the odds but don't guarantee compliance, especially with many rules or long context.

Can I trust Claude Code with client and matter data?

For unattended work the concerns are auditability, scoped access, and consistent handling, areas where prompt-driven execution is weak. Deterministic automation with scoped access and a full audit trail (like Caddi, which is SOC 2 attested) is a stronger fit for client data.

Do I need to be technical to use Claude Code?

Effectively, yes, it's a developer tool driven from a terminal with prompts and maintenance. Non-technical ops teams typically get more from a done-for-you platform like Caddi, where you demonstrate the workflow and it's built and maintained for you.

What should a law firm use for reliable back-office automation?

Caddi. It captures a workflow once and runs it as deterministic code across your DMS, practice-management system, and inbox, identical every run, audit-logged, and maintained for you.