The Research on Assignment Design and AI Substitution Is Clear. Institutions Aren’t Reading It.
Last spring, my department spent most of a two-hour meeting arguing about which AI detection tool to recommend for instructors. Feature sets, pricing, student privacy. Two hours. Nobody asked whether the assignments students were submitting were even designed to resist AI substitution in the first place.
I’ve said this before in that room. I’ll say it here: if a large language model can fully complete your assignment, the problem isn’t the student. The problem is the assignment.
This sounds more radical than it is. The strongest predictor of AI substitution in academic work isn’t how sophisticated the AI is or how lenient the instructor is. It’s assignment design. When assessments don’t require first-person knowledge, course-specific context, or documented individual process, they’ll be substituted at high rates no matter what detection infrastructure you put in place. That finding comes from the contract cheating literature, predates generative AI by more than a decade, and hasn’t changed.
What the Contract Cheating Research Actually Found
Most institutional AI policy discussions have skipped past a fairly robust body of academic integrity research. It concerns contract cheating, which is what we called assignment substitution before ChatGPT made it free and anonymous.
Thomas Lancaster and Robert Clarke started publishing on this in the mid-2000s. Their core finding, replicated across multiple studies through the 2010s: assessment substitutability is the primary structural vulnerability in academic work. Any assignment that can be completed without knowledge of the student’s own experience, analytical process, or course context can be completed by someone else. Or now, something else.
Phillip Dawson’s work on assessment design and academic integrity extends this directly. His research draws a distinction between assessments that test for the presence of knowledge and assessments that require applying that knowledge in conditions only the enrolled student can replicate. The first type is vulnerable. The second, considerably less so.
The International Journal for Educational Integrity published research on this throughout the 2010s. None of it is obscure. It’s been sitting there, saying the same thing, while institutions kept asking which detection tool to license next.
What AI Substitution Reveals About Assessment Design
Generative AI didn’t create assignment substitutability. It made it free, fast, and anonymous. That’s a scale problem, not a new one.
Frankly, the list of vulnerable assignments is longer than most of us want to sit with. A five-paragraph essay on the causes of a historical event: substitutable. A reflective analysis connecting that event to the student’s family history: not. A generic business ethics case analysis: substitutable. A case analysis where the student has to connect the scenario to a company they’ve been tracking across ten weeks of course journals: considerably harder to replace with AI output.
The principle isn’t complicated. AI generates adequate responses to formal, topically general prompts. It struggles when assessments require individual grounding, real-time course-specific information, or documented process that only the enrolled student could have generated. The more an assignment can be answered without ever attending the class, reading the syllabus, or knowing who the student is, the more substitutable it is.
I’ve redesigned three of my own assignments over the past two years to stress-test this. In two of the three, AI still produced passing work with minor editing. That’s a humbling result. To be clear: I’m not claiming that design alone fixes anything. I’m saying it addresses the structural vulnerability in a way that deploying a detection tool does not.
There’s also something underplayed in these conversations. Students aren’t all gaming the system deliberately. Some are genuinely confused about what counts as using AI versus having AI write for them. Assessments that require documented process, like drafts, revision histories, or recorded reflection, give students fewer ambiguous spaces and give instructors something more concrete to evaluate than a detection percentage.
There’s a practical knock-on worth naming, too. The same design features that resist AI substitution tend to produce more engaged student work. That’s not coincidence. Assignments that require students to draw on their specific experience or documented process are harder to phone in, regardless of the tool used to do it.
Why Institutions Reach for Detection Tools Instead
Detection tools are faster to deploy than curricular revision. That’s the honest answer and I think most faculty know it.
Redesigning assessments requires time, professional development, often committee approval, and sometimes a full reconceptualization of a course. The work lands on instructors. It’s slow, it’s costly in terms of faculty time, and the results are uneven across disciplines. A detection tool subscription has a rollout date, a vendor presentation, something to point to. “We’re revisiting our assessment design over the next eighteen months” doesn’t have any of that.
In fairness, large enrollment courses with hundreds of students can’t easily run individualized assessments without structural support that most universities aren’t funding. Those pressures are real. But detection without assessment redesign doesn’t solve the substitutability problem. It covers it. And the covering creates its own costs.
I’ll mention this because it’s relevant context: I pay for Proofademic out of my own professional development budget because my institution licenses Turnitin and considers the AI detection question resolved. It isn’t. And the individual instructors left to sort this out on their own time and resources are operating under a lot of pressure that the institutional response has largely ignored.
What Detection Can and Cannot Do
I want to be precise here because colleagues frequently conflate the two problems.
Detection makes sense for identifying students who submitted unedited or lightly edited AI output without meaningful effort to adapt it. That group exists, and the conversation that follows matters. In my own workflow I use Proofademic for exactly this purpose. The sentence-level breakdown gives me something I can discuss with a student: a specific passage, a flagged phrase, a structural pattern, rather than a single percentage I can’t explain. That’s what I need to document a case or have a conversation that holds up to scrutiny.
What detection doesn’t do is replace the harder work of designing assessments that require something only the student can provide. Students who can produce AI output that passes detection, through humanization, careful editing, or iterative prompting, will keep doing it. The ones most affected by aggressive detection deployment tend to be students writing in a second language, students with formal academic writing styles, and students who wrote their own work but produced text that statistical models read as AI-generated.
I’ve had those conversations. They don’t appear in a vendor’s slides on accuracy rates.
Turnitin’s AI module gives one score and no explanation. I can’t walk a student through a number. Proofademic performs better than GPTZero on academic writing in my experience, and its sentence-level analysis is genuinely more useful for the conversations I need to have. But “better” isn’t “solved,” and no detection tool fixes the assessment design problem underneath it.
What a Better Institutional Response Looks Like
Faculty governance committees deciding on AI policy should read the assignment design literature before deploying detection at scale. That’s not an unreasonable thing to ask. Lancaster and Clarke’s work is widely cited. Dawson’s assessment design framework is available to anyone who looks. The research has been saying the same thing for fifteen years.
The educator-facing resources that take this research seriously are honest about what detection can and cannot do. That’s what I look for in any tool or framework I’m willing to put my name behind in a department meeting. Detection for triage. Assignment design for structural resistance. Both matter. Neither substitutes for the other.
What I keep watching instead is the same institutional cycle: detection tool deployed, limits discovered, equity concern raised, tool reconsidered, next vendor brought in. The assessments themselves don’t change.
The current moment is an opportunity to break that cycle. AI substitution has made the assessment design problem impossible to ignore in a way that contract cheating never quite managed to, partly because of scale and partly because the tool is sitting in every student’s browser. That visibility is an argument for addressing the structural problem, not for adding another surveillance layer over it.
I’ve said this before. I’ll keep saying it. The assessments that can’t be meaningfully replaced by AI are the ones worth defending. The ones that can be replaced needed rethinking anyway.
For colleagues working through AI policy in their departments, the AI and Academic Integrity resource at Proofademic is worth reading. It’s one of the more balanced treatments of the detection-versus-design question I’ve found, and it’s useful for faculty trying to have evidence-grounded conversations in rooms that keep defaulting to the tool question.
If you’re navigating this in your own institution, hit reply. I’d like to know what your department is actually doing.

