Is the future written by bots?

ChatGPT failed my course: How bots may change assessment

Does ChatGPT spell the end of the essay? No, but it may improve assessment.

Chris Lee – Feb 21, 2023 1:34 pm | 107

Credit: Aurich Lawson | Getty Images

One of the most unpleasant aspects of teaching is grading. Passing judgment on people is never fun, and it’s even less fun when you’ve spent months interacting with those people on a daily basis. Discovering that your students have tried to get a leg up by using an AI chatbot like ChatGPT has made the process even more unpleasant. From a teacher’s perspective, it feels a bit like betrayal—I put in all this effort, and you respond by trying to do an end-run around the assessment.

Unfortunately, the bot-writing horse bolted long ago. The stable is not just empty; it’s on fire.

So what is the right response to ChatGPT in education? Is there even a single correct response?

Before we get into the whys and wherefores of ChatGPT, let’s jump to the conclusion: It’s important that we don’t enter into an arms race. I don’t want to spend my limited time and energy trying to detect the use of writing tools. I don’t want to pay large fees to access tools that detect bot-written text. I also don’t think avoiding bot output by taking a great leap backward to written exams is an acceptable solution—we already generate large numbers of students who can pass exams while not actually being able to apply what they “know.”

We have to ask what we want to assess. And do we really need to use things that bots can produce—essays or reports—as proxies for our assessment?

ChatGPT is just the latest

Students have been using writing aids for a long time. Grammarly and QuilBot have been rephrasing students’ sentences for as long as they have existed. Before that, asking more fluent friends to help get the phrasing and flow right was the norm. Essay mills have churned out paid-for papers for forever. In short, ChatGPT is new only in the sense that it is accessible—and far more prone to churning out nonsense.

But ChatGPT is also different from tools like QuilBot and Grammarly. The latter tools suggest improvements to existing text—they don’t write an entire essay. From my perspective as a teacher of science and technology, these writing tools are something students can learn from if they want to improve their writing. In other words, before ChatGPT, none of the tools got in the way of me evaluating my students.

This is less true for my wife, who teaches high school English. English is the second language of most of her students, so vocabulary and grammar—the exact things QuilBot and Grammarly target—are still important assessment points. Her senior school students possess a wide range of English proficiency, with some being near-native speakers and others only capable of expressing simple ideas. In senior school, the focus shifts to building arguments and conveying them fluidly in different written and non-written forms. The previous tools only helped if the assignment was written and if the student had already managed to structure an argument.

ChatGPT changes the calculus. You can see the appeal of the chatbot in the classes my wife teaches. For good students, it saves them from doing what they can already do; for struggling students, it turns an almost-certain fail into a chance of passing.

Part of the senior school assessment is a writing portfolio, and boy, has ChatGPT been working hard here. Some portfolios, including the reflection on building the portfolio, were entirely written by ChatGPT. Some of the reflections, which are supposed to describe a student’s experience in doing the assignment, even referred to a number of random pieces of writing not included in the portfolio, as ChatGPT has no context, only statistics.

The school administration is uncertain about how to proceed. Unlike plagiarism, the case against ChatGPT use is less defined and no one wants to face the threat of a lawyer.

In the technical courses I teach, the appeal of ChatGPT is high, while the utility of ChatGPT is less obvious. My students must perform research related to one of their practical projects, and that research is partially assessed in the form of a chapter of the project report. Can you really provide ChatGPT with a prompt that gives it enough context on a student research project?

Despite that question, I’ve also seen written material from ChatGPT. The suspected ChatGPT chapter was poorly sourced, superficial, voiceless, and dull. Within that chapter, different sections were supposedly written by different students, but the voice, style, and sourcing did not change at all. ChatGPT’s effort wasn’t all bad, though; the writing was clear and flowed nicely.

Effective and ineffective bot-writing

The writing portfolio for my wife’s class required only that pieces of writing were internally consistent. This is something ChatGPT can do. All of the writing portfolio cases were caught because the style did not match the student’s writing style in earlier handwritten work.

In my class, the students believed that the internal consistency ChatGPT offered trumped providing good sources, nuance, depth, and contrast. But they were wrong. The writing did not show evidence of good research skills. Combined with the lack of a style change between authors in the research chapter and the large change in style between the research chapter and other report chapters, it was easy to spot the people using the software.

This is where I have it easier than my wife. She is teaching style and structure, independent of a specific topic, while I am not. For my class, style and structure must be demonstrated in the description of an authentic task—a project. Authentic tasks require a process, and that process must be part of the assessment.

ChatGPT does not understand authentic tasks, it does not understand process, and it certainly has no context for what a student might have done in their project. All it has is a very good statistical model of word association. It can produce very fluid sentences on specific topics, and it will even be factually correct when the statistical relationship mirrors factual correctness. In a task that also involves building and testing real-world objects, ChatGPT will struggle to reproduce a reasonable process that leads to the final essay or report.

What next?

Some teachers are panicking and abandoning their recent developments in favor of in-class assessments. I don’t believe this is the solution. The teachers I’ve spoken to are not interested in permanently returning to the safety of written exams, though; this is an interim measure while they figure out how to deal with the influence of ChatGPT.

An in-class or exam-style assessment does not allow for well-researched writing. It does not allow the student the time to experiment with different ways of expressing their argument or even to check the validity of their argument against external sources. A well-researched essay takes more time than can be accommodated in an exam setting.

If we accept that and give students essay-like questions (or reports or some other form of writing assignment), the students can and will use ChatGPT. Consequently, we can expect poorly sourced, factually incorrect essays from students—something I’m already familiar with—but now they’ll have excellent grammar, no voice, and about as much color as an albino mouse.

What else distinguishes the two? A good essay—even one with some mistakes—should have a trail of evidence leading to its production. No one naturally writes and sources an essay in one pass. Thus, it’s possible to require that students provide evidence of their process during production (a good habit to build anyway).

Of course, ChatGPT can produce essays, it can produce reflections, and it can produce logbooks, which are types of evidence you might ask of a student. But ChatGPT cannot create the full chain of research, from finding sources through the final essay product. Furthermore, it cannot link ideas to video evidence or audio presentations. In other words, it cannot produce coherency between the intermediate products and the final essay.

If students provide this evidence, does that mean they didn’t use ChatGPT? Not necessarily, but it does mean they had to take the time to ensure the arguments were well-supported and factually correct. They would have to verify that the statistically generated text accidentally produced an argument they understood and agreed with.

As far as I am concerned, if the student acknowledges using ChatGPT, I will assess the product. This includes the arguments in the final product, the intermediate products (research notes, search terms and trees, and more), the sourcing and cross-checking, and additional non-written evidence, but not the quality of the writing. Admittedly, that level of assessment will take more time, which is why it needs to be turned into a high-stakes moment for the student so they think carefully before opting for it. Because of the time involved in evaluating such a project, there should be only one such assessment in an academic period (e.g., once per course, or once per semester, etc.).

One argument against a single, high-stakes assessment is that students don’t have an opportunity to develop the experience needed to pass it. But if you provide many tiny assessment moments that give them that experience, ChatGPT will pass your course for the student. Are you back to square one?

No, provided these low-stakes assessments don’t contribute to the grade. Each low-stakes moment allows the student to practice, which they will need to do if they want their writing to improve. To preserve enough of my time to do the high-stakes assessment, my preferred solution is self and peer assessment.

These assessments still give the student ample opportunity to practice their writing and argument-wrangling skills. But if a teacher takes the time to build peer- and self-assessment tools and skills, they add an extra opportunity for students to improve their writing. Yes, the students can cheat all they want on these low-stakes moments. But if the students can write and argue well, they will pass your high-stakes assessment regardless of whether they cheated earlier.

For students who struggle with writing, however, avoiding hard work during the low-stakes assessments will increase the chance that they never develop the skills needed to pass the high-stakes assessment. Thus, their poor decision-making has consequences, a good life lesson.

Our bot future

For better or worse, bot writing is here to stay. Despite the teeth gnashing, it’s a tool students will use. At some point, someone will probably figure out how to add a measure of “factual correctness” to the statistical model. It should also be possible to deepen the model so that coherency is maintained across several documents (e.g., a reflection that refers to actual documents in a portfolio in a seemingly meaningful manner).

As these changes occur, we will have to reconsider how we assess students and what skills we deem meaningful. The most important skill is learning how to learn. Everything else is gravy, so the real challenge is to ensure that this skill is developed even when a bot creates some of the output of the learning process.

The flip side is that skills that I love and treasure may become less important because “that’s the bot’s job.”

Listing image: Aurich Lawson | Getty Images

Chris Lee Associate writer

Chris writes for Ars Technica's science section. A physicist by day and science writer by night, he specializes in quantum physics and optics. He Lives and works in Eindhoven, the Netherlands.

107 Comments