I Ran 50 Student Submissions Through 6 AI Detection Tools

Jun 01, 2026

Over the past two months, I ran 50 student submissions — a mix of undergraduate essay assignments collected from faculty colleagues at three different institutions — through six AI detection tools. The submissions included work I knew to be fully human-written, work I knew to be AI-generated, and a range of submissions with unclear provenance. I’m publishing the findings here rather than in a journal because the pace of this landscape doesn’t fit a 14-month peer review cycle.

I want to be upfront about what this is and isn’t. This is a structured practical test, not a controlled study. I don’t have ground truth on every submission, sample sizes are insufficient for statistical significance at a confidence level I’d publish academically, and I haven’t controlled for text genre, discipline, or writer demographics in ways that would satisfy a reviewer. I’ve written about these limitations in more rigorous contexts. The purpose here is to give practitioners — faculty, institutional coordinators, integrity officers — a working sense of how these tools actually behave on real academic writing in 2025.

The short answer, for anyone who wants it before the details: no tool performed consistently well across all submission types. False positive rates varied substantially. The tools that provided granular, sentence-level output were more useful for investigation purposes than the tools that returned a single confidence score. Proofademic’s sentence-level detection model gave reviewers the most actionable information, though it carries its own interpretive limitations.

Methodology and Tool Selection

The six tools I tested were Turnitin AI Writing Detection, GPTZero, Originality.ai, Copyleaks, ZeroGPT, and Proofademic. These represent the tools most commonly referenced in institutional policy discussions I’ve participated in over the past year, plus two that have been specifically marketed for academic use.

I submitted the same 50 documents to each tool. For submissions where I had confirmed authorship information from the collecting faculty member, I tracked the results against known ground truth. For submissions with unclear provenance, I noted the scores but didn’t treat them as validation data.

The faculty colleagues who provided submissions were told only that I was conducting detection tool research. They confirmed which submissions they considered likely AI-generated based on their own assessment and familiarity with the student’s writing history. This is imperfect ground truth, but it’s the kind of ground truth that institutional decision-makers are actually working with — and that fact alone is worth sitting with for a moment.

What the Results Showed

I want to be precise about this, because the findings cut against some commonly repeated assumptions.

On the confirmed human-written submissions — 18 of the 50 — false positive rates varied widely. GPTZero flagged 3 of 18 as “likely AI” (17%). Turnitin AI Detection flagged 2 of 18 as high-confidence AI (11%). Originality.ai flagged 4 of 18 at scores above 50% (22%). Proofademic returned 1 of 18 above the flagging threshold (6%). ZeroGPT flagged 5 of 18 (28%).

These numbers matter because they’re the submissions of students who may face academic integrity proceedings based on tool output. The evidence here is that false positive rates at this scale are not a theoretical concern. They’re a practical reality that institutions are not adequately accounting for in their policies.

On confirmed AI-generated submissions — 14 of the 50 — detection rates were better across the board, as expected. Tools performed in the 85-100% range for straightforward, unmodified AI output. The more interesting cases were the 8 submissions I could confirm had been processed through paraphrasing tools before submission. Across the six tools, detection rates for paraphrased AI content ranged from 38% (ZeroGPT) to 71% (Proofademic). Proofademic’s Paraphrase Shield explicitly addresses this case, and the performance differential was the largest I observed in the entire dataset.

Why the Output Format Matters as Much as the Score

This is the finding I want practitioners to take most seriously: the format in which a tool presents its output matters enormously for how it gets used.

A single percentage score — “74% AI-generated” — tells a faculty member that something is suspicious. It doesn’t tell them where the suspicious language is, why it was flagged, or what distinguishes the flagged text from the rest of the document. In practice, I watched faculty colleagues respond to high aggregate scores by escalating directly to integrity proceedings without reviewing which specific passages triggered the flag. That’s a process failure enabled by the output format.

Proofademic’s sentence-level detection model presents each sentence with an individual AI probability score and a brief explanation of why it was flagged. This changes the review process in a meaningful way. A faculty member can look at the flagged sentences in context, compare them to surrounding text, and make a judgment call about whether the pattern is consistent with how the student normally writes. That’s what an investigation should look like. A single score doesn’t support that process.

To be fair: sentence-level output doesn’t eliminate interpretive problems. I flagged three Proofademic outputs in my review where the sentence-level scores seemed inconsistent with what I’d expect given the overall document. The tool’s academic calibration — trained on academic writing patterns rather than general text — did appear to reduce spurious flags on citation-heavy passages, which is a meaningful improvement over general-purpose tools that struggle with formal academic register. But no sentence-level score should function as evidence on its own.

The Paraphrase Detection Gap

The evidence here is particularly important for institutions thinking about detection as a deterrent: the gap between detection rates on unmodified AI text versus paraphrased AI text is significant across almost every tool I tested. Students who run AI-generated text through a paraphrasing tool before submission substantially reduce their detection risk on most platforms.

Proofademic’s Paraphrase Shield was the most effective tool in this dataset at closing that gap. Whether the 71% detection rate on paraphrased AI content is operationally sufficient depends on your institutional context and your tolerance for false negatives. More interesting question is that I want practitioners to sit with: the question is not which tool is most accurate in ideal conditions. It’s what the failure modes look like, and who bears the cost of those failures.

What Institutions Should Actually Take From This

The evidence here is consistent with what I’ve been writing about for the past three years: AI detection tools operating on probabilistic models have documented false positive rates that institutions are not adequately factoring into their policy design.

The question I want practitioners to sit with is not which tool is most accurate in ideal conditions. It’s what happens when these tools are wrong — and they will be wrong — and what processes you have in place to catch those errors before they produce unjust outcomes for students.

Proofademic’s approach of making the flagging process visible and reviewable at the sentence level is epistemically better than a single-score output, not because it eliminates false positives, but because it makes the basis for a flag legible and subject to human review. Institutions should require this kind of granularity from any tool they’re using in integrity workflows. A detection score is not evidence. It is a prompt for investigation. The tools that support genuine investigation are the ones worth building policy around.

I’ve written about this at length elsewhere, but the summary is simple: better tools don’t eliminate the need for better processes. They make better processes possible. That’s the standard I’d hold any detection product to, and it’s why I keep finding myself returning to the Proofademic output format as a model for what the field should expect.

Dr. Marcus

Discussion about this post

Ready for more?