The Detection Score Does Not Prove Anything. Most Institutions Treat It Like It Does.
When an institution receives an AI detection report showing that a student’s paper had an 84% likelihood of having been written in whole or part by AI, the decision-making begins. What happens next, and whether the institution takes action, is largely dependent on how the institution chooses to interpret that number. Having reviewed many institutions’ academic integrity policies across the United States, most are making the exact same mistake: they’re treating a probability estimate like a determination.
They’re two very different things. And the differences have significant consequences for students.
What a Detection Score Really Means
Detection tools provide probabilistic estimates. They identify statistical features of text, primarily perplexity and burstiness, and determine how closely those statistics resemble patterns found in samples of both human-written and AI-written content. The result is an estimated probability: how close is the detected text to the patterns of AI-generated content?
To be clear: the detection score doesn’t prove that a student used AI. It shows how closely a student’s text statistically matches a training distribution. These are two very different statements, and each implies a different type of claim about how a detection score should be interpreted and used.
Evidence for this comes from published false positive rates for major detection tools. Those rates vary widely, ranging from 8% to 17%, based on factors such as which tool was used, the style and genre of writing sampled, and the demographics of the student population. Independent validation studies, as opposed to vendor-provided benchmarks, consistently show wider error margins than marketing materials suggest.
How Institutions Are Misusing the Tools
Most institutions use detection tools without developing procedures for interpreting results that would justify their use. They establish arbitrary thresholds, such as 25% or 50%, and treat any score that exceeds them as presumptive evidence of academic dishonesty. This is a fundamental misapplication. A threshold on a probabilistic output is a reason to investigate, not proof of misconduct.
Treating that threshold as more than a reason to investigate, without speaking directly with the student, without comparing the flagged assignment to other work the student has submitted, doesn’t satisfy any reasonable evidentiary standard for an academic misconduct determination.
When GPTZero or Turnitin’s AI detection feature returns a high score, the appropriate institutional response is to treat it as one additional reason to look more closely. The score doesn’t provide sufficient cause for taking action.
What’s Required to Support an Academic Misconduct Determination
Academic misconduct decisions can carry serious consequences, including grade penalties, transcript notations, possible expulsion, and in some jurisdictions, permanent records. The evidentiary standard for decisions with those consequences should be correspondingly high.
The real issue isn’t whether a detection tool flagged something. It’s whether, after a thorough human review of all work submitted by the student, there’s evidence supporting a conclusion that the student submitted work that wasn’t their own. Detection output can assist that review. It can’t substitute for it.
A tool that makes the review process more rigorous, rather than simply returning a number, is doing something more useful. Proofademic, for instance, reports at the sentence level, surfacing which specific sentences triggered detection and why. That kind of detail lets reviewers assess which sentences contain specific characteristics, and how those characteristics compare to other work the student has submitted. It shifts the process from “the score was high” to “these specific passages have these specific characteristics, and here’s how they compare to the student’s other submissions.”
That doesn’t solve the evidentiary problem entirely, but it’s a far better foundation than relying on a single percentage.
What Institutions Can Do
Document your review processes. If an institution intends to use detection scores as grounds for initiating academic integrity proceedings, the methodology used in evaluating those scores, including what the scores represent, how they’re weighted, and what other evidence is required before reaching conclusions, must be documented and accessible to students and faculty.
Stop treating thresholds as determinations. A 60% AI detection score doesn’t represent a 60% chance of academic dishonesty. It’s a probability estimate produced by a system not designed as an adjudicative tool, applied in conditions and on populations it wasn’t validated for, whose characteristics may differ significantly from the training data.
The current wave of detection-first academic integrity policy will generate large numbers of contested cases and formal appeals over the next few years. Institutions that develop rigorous interpretive standards now will be better positioned than those that wait for litigation to force the issue.
A detection score is one tool among many. When institutions rely on it as their primary tool, the evidence base doesn’t support the weight they’re placing on it.

