In May 2024, I argued [1] that journals should encourage authors to use large language models (LLMs), like ChatGPT, Claude, and Gemini, to assist with stylistic revision and clarity. This year, my colleagues and I are wrestling with the question of whether and how we should restrict student use of LLMs in their coursework. Once again, I find myself to be more open than many others to LLM use. Moreover, the type of use I contemplate has broadened beyond mere editing to idea generation and drafting of content.
Expressing concern that unfettered LLM use undermines learning, restriction proponents call for conditions on use, such as requiring substantial student revision of LLM-generated text, student integration of their thinking into the final product, and explicit acknowledgment of the LLM’s role.[2] Some of these colleagues also advocate for a complete prohibition on work primarily generated by an LLM.
My opposition to most LLM restrictions starts with pragmatic considerations. First, they seem unenforceable. In the LLM detection arms race, the detectors seem to be losing. Although they are improving, high detection rates seem to come with an unacceptably high number of false positives.[3] Second, LLM restrictions seem vague. What is substantial revision? What constitutes integration of the student’s own thinking into the work? With the risk of substantial sanctions, restrictions could limit student use far more than what is intended or useful.
Most importantly, I do not believe that looser LLM restrictions will undermine learning because, based on my experience to date, LLMs seem incapable of the critical thinking that turns descriptive, summary text into a compelling argument. For a scientific paper, for example, the Background must argue that addressing the research question would generate new knowledge that confers benefits. The Methods must provide information to make the paper’s analysis replicable and make the case that the author’s judgments (e.g., the type of statistical techniques used) align with how other investigators would proceed. The Results must describe the findings, and the Discussion must apply them to the research question, convincing the reader that the conclusions are reasonable in light of both prior knowledge and the current work’s limitations. Throughout, the author must also cite external evidence for claims not supported directly by the current analysis.
An advantage of allowing and even encouraging students to use LLMs is that it frees them from many time-consuming tasks – crafting text in particular – so that they can focus on the soundness of their arguments and the careful provision of evidence to support their assumptions and propositions, something that I have argued [4] health economists should do more of.
There are caveats. First, LLM users remain responsible for the final product – e.g., that it is hallucination-free. Users must also ensure the LLM has not introduced plagiarized content. Making effective plagiarism-checking software widely available should be a priority. Second, research indicates that LLM use can undermine some aspects of student work. For example, students may over-rely on LLM-generated content [5], leading them to accept it without careful consideration and resulting in errors.
Rather than restricting LLM use, the best way forward is to craft assignments requiring students to engage in the type of critical thinking that LLMs cannot perform and then focusing assessment on those aspects of the work. I anticipate placing a greater emphasis on these “higher level” skills and holding my students to higher standards. For example, I plan to place even greater emphasis on the need for students to annotate their citations by documenting supporting passages in cited texts. Because I expect that LLMs will make it possible for both teachers and students to focus more on these high-level, critical thinking skills, I remain optimistic that increased LLM use by students will lead to more learning, not less.
Endnotes
- The commentary concludes, “Journals should move away from policies that can seem to grudgingly permit chatbot use and instead enthusiastically promote the appropriate and responsible incorporation of chatbots into the author’s toolbox.”
- For this article I used ChatGPT version 5 to summarize articles I reviewed as I researched the topic. ChatGPT did not generate any content for this column.
- The article states, “While a false positive rate of 1-2% might seem low, the scale of educational assessment means this could still translate into a substantial number of false accusations. For example, consider an institution with: 20,000 students; Each taking 8 modules per year; With 3 assessments per module – That would amount to 480,000 assessments per year. Even a 1% false positive rate would generate approximately 4,800 false positives annually – a huge burden to investigate and manage and potentially damaging for student trust and wellbeing.”
- The commentary notes that in law journal articles, “Footnotes paraphrase how a source supports a proposition and often specify the pages where that support appears… editors check every sentence in a manuscript to ensure that it has support, and that documentation is explicit, specific, and clear.” The commentary concludes, “… health economics would benefit from adhering to a similar standard. Authors would need to spend more time on documentation, but by being able to independently evaluate the connection between the cited evidence and its use, we could all have greater confidence in the resulting work.”
- The abstract states, “Overreliance on AI occurs when users accept AI-generated recommendations without question, leading to errors in task performance in the context of decision-making... Our findings indicate that over-reliance stemming from ethical issues of AI impacts cognitive abilities, as individuals increasingly favor fast and optimal solutions over slow ones constrained by practicality.”