As someone knee-deep in building a company that uses AI for legal research, I’ve watched the recent “hallucination leaderboard” with fascination and dread. This isn’t an official contest anyone wants to win. It’s a running tally of artificial intelligence screw-ups. It’s where chatbots invented fake facts or citations that ended up in real court filings.
Every time another lawyer gets burned by a chatbot’s fantasy case law, I shake my head. It’s wild that in 2025, we have a scoreboard tracking which AI models lead in making stuff up. And honestly, I’m pretty proud (and relieved) that Caseway isn’t anywhere on that list. This is one list that we don’t want to be on.
So what does this “leaderboard” show? In plain terms, it’s a database of instances where generative AI went off-script in legal documents. You can think about fake case citations, imaginary statutes, and other facts that don’t exist. Currently, it’s logged around a hundred-plus cases worldwide, which keeps climbing each month.
A massive chunk of these incidents comes from U.S. courts (no surprise, given how popular artificial intelligence is in the United States), but it’s not just a United States problem. We see examples from Canada, Europe, Israel, Australia, and South Africa. Bad legal artificial intelligence plagues the world. This is one of the main reasons why we started Caseway.
When it comes to which artificial intelligence is causing the trouble, the leaderboard has some predictable all-stars (or rather, bad actors.) Here they are!
OpenAI’s famous chatbot dominates the list. This makes sense because everyone and their cousin has tried using ChatGPT for quick legal research. If a platform is super widespread, it’s statistically going to cause the most havoc. And oh boy, has it caused havoc. Dozens of lawyers and even pro se litigants have cited “cases” ChatGPT invented out of thin air. Why do people use ChatGPT for legal research?? It’s insane.
It’s not only ChatGPT messing up, though. Anthropic’s Claude has a few entries on the list. This included an ironic case where a law firm used Claude to defend Anthropic itself, but more on that in a sec. We’ve got at least one instance of someone using Google’s Bard for legal research with equally cringeworthy results.
In that Bard case, the user innocently thought Bard was like a “super-charged search engine.” They didn’t realize that it would confidently output fake case law. Big opps. Even niche legal-specific software has made appearances. For example, one law firm’s team used a combination of Westlaw’s AI research platform and CoCounsel (an AI from Casetext) to draft a brief. This resulted in a medley of hallucinated citations and a very unhappy judge. In short, even models tailored for law can slip up.
Open-ended chatbots lead the hall of shame, but even more “serious” legal platforms or integrations have generated nonsense when pressured. The trend also seems to be accelerating: As more lawyers try this software, the more bloopers get recorded. Over twenty new cases popped up in the database in one recent month, most from United States courts. It’s both scary and somewhat expected. The greater adoption of AI means more chances to foul up.
It’s worth unpacking why these large language models hallucinate in the first place. It’s something we grapple with daily in the AI space with Caseway. The simple explanation is that these AIs don’t actually “know” facts like a database. They generate words based on patterns.
If you ask for a legal case on a specific obscure point and nothing relevant comes to mind (or training data), a vanilla LLM will still produce an answer to avoid disappointing you. It might mash up legal-sounding jargon and names, essentially bullshitting in a very confident manner. The artificial intelligence isn’t lying in the human sense. It’s doing what it was built to do. That is to predict a plausible sequence of words. Unfortunately, “plausible” isn’t good enough when a lawyer or self-represented person needs actual precedent or a judge needs real court decisions.
Many of us have noticed a trade-off in practice. The more an AI tries to be creative and helpful, the greater the risk of making something up if it doesn’t honestly know the answer. Conversely, if you make an AI super cautious (always refusing to answer anything it’s unsure about), it becomes less useful and sometimes frustrating. Finding that balance is tough.
Many LLMs pull from the entire internet, which can lead to confusion. With Caseway, we only pull from court decisions.
Some companies attempt hacks like hooking the artificial intelligence up to a database or search engine so it has absolute references. Retrieval-augmented generation (RAG) is the buzzword. That can reduce hallucinations, but it’s not foolproof. It often makes the AI slower or more cumbersome to use. In my experience, every solution to tame hallucinations comes with a cost in either complexity or user experience.
The abstract idea of “AI hallucination” became a lot more real when it started spilling into actual court filings. One high-profile example that still blows my mind is the New York “ChatGPT lawyer” fiasco from 2023. If you missed it, here’s the low down. A lawyer had a federal case (Mata v. Avianca) and, short on time, let ChatGPT help with legal research. ChatGPT confidently supplied a list of court decisions that perfectly supported his arguments. The problem was that none of those cases were real.
They were literally figments of the AI’s imagination. It came complete with fake quotes and citations. The lawyer, who apparently didn’t know ChatGPT has this hallucination quirk, submitted those bogus cases in a brief. You can guess what happened next. The opposing counsel and the judge quickly figured out the citations were bunk. It was a considerable embarrassment.
The judge was livid. There was a hearing where the poor attorney had to admit a chatbot had duped him. He (and another lawyer on the team) got sanctioned and fined and had to notify all the honest judges whose names were misused in the fake citations. That case became the poster child for AI gone wrong in law.
Fast-forward to 2025, and we have more war stories. Remember when I mentioned Anthropic’s Claude showing up on the leaderboard? In a recent twist, a team at a major law firm used Claude to draft an expert report in a lawsuit against Anthropic (talk about meta). Claude messed up the titles of some papers and made other minor factual errors in the report.
The mistakes were enough that the opposing side moved to throw out the expert’s testimony entirely. The judge hasn’t ruled yet, but it’s a bad look. An AI company’s AI software undermined its legal defence. This stuff would be comedy gold if real people’s jobs and reputations weren’t on the line.
From scouring the hallucination database, I’ve seen outcomes ranging from mild to severe for these incidents. A few lucky (or charming) lawyers got off with a stern warning from the judge. Many weren’t so fortunate. Courts have issued fines from a few hundred bucks up to tens of thousands.
In some cases, attorneys had to pay the other side’s legal fees for wasting everyone’s time. Judges have ordered mandatory continuing legal education classes on ethics and AI for the offenders. In one extreme scenario, a judge even contemplated referring a repeat offender to the bar for possible disbarment. In other words, misusing a chatbot can end your career real quick. It’s serious business.
Working on Caseway, I’ve been hyper-aware of all these failure cases. It’s part of why we built Caseway the way we did. Caseway is a legal platform that specializes in legal research for those unfamiliar. But unlike a general-purpose model, it’s laser-focused on real court decisions.
We build it (and continue to fine-tune it) on a verified corpus of legal texts. If you ask Caseway a question about case law, it’s not winging it from a shallow internet training set. It’s retrieving answers from actual reported cases and statutes. The idea is to reduce the odds of a hallucination massively. Simply put, the artificial intelligence can’t easily spit out a case that doesn’t exist, because it’s drawing from a library of existing cases.
Now, I won’t claim Caseway is magically incapable of ever hallucinating. Any generative model can go off the rails under the right (or wrong) conditions. But so far, so good. The fact that Caseway hasn’t shown up in any courtroom horror story is a point of pride for our team. We’ve consciously decided to sacrifice a bit of the “anything goes” flexibility that something like ChatGPT has.
If Caseway doesn’t have a source-backed answer, it’s designed to admit it rather than fabricate. In day-to-day use, that might mean sometimes it gives a shorter answer or says it can’t find something. We’re okay with that. I’ll take a slightly less verbose AI if it means I’m not unknowingly citing Bob’s made-up Supreme Court ruling of 1857.
Building AI that won’t hallucinate so easily is challenging. It involves a lot of fine-tuning, guardrails, and yes, sometimes a manual review of outputs. It’s not as flashy as letting the AI just dream up whatever, but it’s worth it. Whenever I see a new entry on the hallucination leaderboard (assuming it continues not to be us), it reinforces our approach.
It’s like a running cautionary tale that keeps us honest. We’ve also learned from those cases, making sure Caseway’s answers reference real cases (so users can double-check instantly), and encouraging a culture of verification. AI is a tool, not a cheat code to skip doing your homework, and we want our users to remember that.
In a hands-on field like law, trust is everything. The lawyers who use our software must trust that it isn’t leading them astray. I find it satisfying that while the big generic models are up there on the hallucination hall of shame, our more targeted approach means Caseway isn’t. Frankly, not being famous for a disaster is the kind of low-key status I’m fine with! It might sound odd to brag about not being on a list, but it is a badge of honour in this context.
The hallucination epidemic in AI is real, and the legal industry is learning its lessons the hard way. As one of the folks building these software platforms, I’m convinced that the solution isn’t to throw our hands up. It’s to create smarter and be more responsible.
Seeing others make the headlines for AI issues has been both a warning and a motivator. Caseway’s absence from the hallucination leaderboard shows that with the right approach, we can have AI that we trust and avoid making it onto that wall of shame. And believe me, we intend to keep it that way.