What AI still gets wrong · Logic of Logic

This site reports on AI every day, which is exactly why this guide exists. Coverage without a failure catalog is advertising. What follows is the honest map: where current models break, why they break there and not elsewhere, and the short list of tasks where the correct amount of AI is still none.

One framing note before the list. These aren’t bugs awaiting a patch. Most trace straight back to what a language model is — a text predictor, as the first guide lays out — so they migrate and shrink across releases, but they don’t vanish. Anyone who tells you a current system “doesn’t hallucinate” is selling something.

Hallucination: the signature failure

A model asked for a fact it doesn’t reliably hold doesn’t say so — it produces the shape of an answer with plausible content inside. A citation to a paper that doesn’t exist, with a realistic title and author list. A court case that was never filed. A statistic with one digit quietly wrong. The prose around the error is impeccable, which is the trap: fluency and accuracy are produced by the same machinery, so the wrongness carries no tells.

Where it bites hardest, in rough order of observed damage:

Citations and quotes. The single most common professional embarrassment. Models reconstruct references from patterns; reconstruction invents. Every quote and citation gets checked against its source, no exceptions.
Numbers and dates. Especially aggregations (“how many X since 2020”) and anything after the model’s training cutoff.
The confident middle. Models are strongest on well-documented common knowledge and surprisingly decent at admitting total ignorance. The danger zone is between: topics documented enough to generate fluent specifics, not enough for the specifics to be right. Niche regulations, small-company details, local rules.
Agreeable error. Push back on a correct answer and the model may fold and “correct” itself into a wrong one. It optimizes for a satisfying conversation, not a won argument. Asking “are you sure?” tests its agreeableness, not the fact.

Retrieval — wiring the model to search and read sources before answering — converts much of this from remembering to reading and genuinely helps. It also fails in its own way: wrong page fetched, right page misread, fluent summary of an irrelevant document. Source links shift your job from impossible (auditing a model’s memory) to manageable (clicking the link and checking). Click the link.

The quieter limits

Hallucination gets the headlines; these cost more hours in practice:

Long-document blur. Big context windows accept whole contracts; attention across them isn’t uniform. The clause on page 41 can be skimmed past, and a “summarize the risks” answer built on pages 1–30. For high-stakes documents, ask section by section.
Arithmetic and logic under the hood. Math routed to a real calculator or code is fine; math done “in the model’s head” is pattern-matching that fails unpredictably just past the familiar. If a tool doesn’t show its work, assume in-head.
Instruction decay. Ten constraints in, models drop some — usually the one you cared about. Long sessions drift; the fix is restating the brief or starting fresh, not arguing.
Sameness by default. Unprompted output regresses to the statistical mean of the internet: competent, generic, faintly familiar. It’s a floor-raiser, not an edge-giver — your voice and your judgment are exactly the parts it can’t supply (the prompting guide is about supplying them).
Stale world, eager tone. Training cutoffs mean the model’s built-in world is months old, while its tone never is. Anything time-sensitive — prices, laws, product capabilities, people’s job titles — needs live sources, not recall.

When not to use it

Capability isn’t the bar; cost of a wrong answer is. A useful rule for one-person and small operations:

Use AI freely where errors are cheap and visible. Add verification where errors are costly. Keep it out entirely where errors are catastrophic, irreversible, or someone else’s to bear.

Concretely, in 2026, still do these by hand or with a licensed human:

Final legal, tax, and medical judgment. Drafting a question list for your lawyer: excellent use. Acting on unverified AI legal advice: how businesses end up as cautionary tales. (Nothing on this site is legal, financial, or medical advice either — that’s the same principle, applied to us.)
Unreviewed customer-facing commitments. A bot empowered to promise refunds, delivery dates, or compliance positions is signing contracts on your behalf with a probabilistic pen. Courts and regulators have already declined to treat “the AI said it” as a defense.
Numbers that move money. Invoices, payroll, tax filings, quotes. AI can draft and cross-check; a human owns the final digits.
Anything you can’t check and can’t afford. The honest catch-all. If you lack the expertise to verify the output and the stakes are real, the model’s confidence is not a substitute for someone who actually knows.

Failure-shaped habits

The limits above compress into four working habits:

Match verification to stakes, not to vibes. Brainstorms ship unchecked; numbers, names, quotes, and claims get sourced. Decide the tier before reading the output — fluency erodes skepticism after.
Prefer reading over remembering. Paste documents, demand sources, use retrieval-backed tools for facts. Then actually open the sources.
Never use the model to verify itself. “Are you sure?” is theater. Verification is a source, a calculator, a test suite, or a human — something outside the prediction loop.
Keep the human where the cost lives. The pattern across every expensive AI failure of the past three years is the same: output flowed to a customer, a court, or a ledger with nobody in between. The fix costs minutes.

How to read accuracy claims

Vendor pages and headlines will quote numbers at you — “95% accurate,” “passes the bar exam,” “PhD-level reasoning.” Three questions defuse most of them:

Measured on what? Benchmark tasks are clean, self-contained, and public — which means models may have effectively seen them in training. Your invoices, your contracts, and your customers are none of those things. Benchmark-to-desk slippage is the rule, not the exception.
What does the error rate mean at your volume? “95% accurate” is another way of saying one error every twenty runs. On a workflow you run fifty times a week, that’s daily errors — fine with review in the loop, corrosive without it.
Which way do the errors fall? A tool that errs by flagging too much for human review is safe to adopt; one that errs by confidently completing is not — at the same headline accuracy. The distribution of failures matters more than their count, and vendors rarely volunteer it. Ask, then pilot with your own material and count for yourself — the tool-choosing guide is the procedure.

Why an AI-news site tells you this

Because the honest version is the useful version. Models in 2026 are genuinely capable — this site is drafted with their help, reviewed by a human, every day — and the businesses getting real leverage are precisely the ones that know where the floor creaks. The failure catalog isn’t an argument against the tools. It’s the user manual the marketing leaves out, and when the ground shifts — when a limit on this page genuinely falls — the daily briefings here will report it, with sources.