Tell Students to Skip ChatGPT for Essays: LLMs Will Plagiarize Internet Texts

Why the pretraining in ChatGPT can get students into serious academic problems

Feb 07, 2024

Welcome to The Predictive Edge

My first post will help you convince students about the dangers of using ChatGPT mindlessly when writing essays. Specifically, ChatGPT will randomly plagiarize pieces of the text it was trained on.

Generated with ChatGPT: AI randomly copies some text

My teaching approach is anything goes: you can use whatever tools to help you solve your task or project. I think it is unreasonable to restrict the use of technologies they will have available in the future; we should be encouraging and teaching them how to use them.

Instead, my strategy is to increase the level of difficulty substantially. For example, in future posts, I will explain how leveraging ChatGPT can guide undergraduate students, even those with little or no coding experience, to implement state-of-the-art models from the latest finance papers, neural networks included.

I do not have bad feelings about using the raw output of ChatGPT for writing. It’s just mediocre in general, with good pieces here and there.

Still, I do not feel students copy-pasting the output of ChatGPT without looking at it is valuable for anyone. I joked with my students that we want to avoid the equilibrium where they use ChatGPT to answer the questions, and I use ChatGPT to grade them.

Hence, we need some deterrents. What if we try to detect if an AI wrote the text? We cannot.

ChatGPT and AI Detectors

AI detectors do not work now, and their performance will likely worsen.1 It is not helpful to elaborate too much. Just don’t use them.

They will mostly penalize minorities, international students, academic-style writing, and anyone using Grammarly or similar editing software.

Instead, let’s look at plagiarism, and I do not mean students delegating their homework to AIs.

ChatGPT likes to copy sentences from some websites, and students can inadvertently have those exact phrases in their essays.

ChatGPT and Plagiarism

Students are very mindful of plagiarism. They know it is one of the few things that can get them into serious trouble. For example, we recently witnessed how the Harvard president resigned after a plagiarism scandal.2

Therefore, one great way to make students suspicious of ChatGPT's output is to show them they will commit plagiarism if they just copy-paste an essay from ChatGPT.

Not because the text is not their writing; we know we cannot detect that. Plus, if they can expertly prompt the model to generate an excellent essay, well, more power to them.

Instead, ChatGPT will plagiarize actual pieces of other writing.

I want you to try the following exercise. Ask ChatGPT to write an essay about anything. I chose the impact of increasing interest rates on stocks with high book-to-market ratios.

I will write more about other drawbacks when using it for writing and how to use it effectively in future posts. For now, copy-paste the output into a plagiarizer detector. I use Grammarly Pro.

Notice how all the plagiarism warnings show up? In this case, 8% of the text is plagiarized!

Moreover, it will plagiarise from texts that are relevant to the topic and should be cited if used. It is not just random sentences, and copy-pasting could get students into severe academic problems.

I explained this issue to my students during the first lecture, and they were terrified.

Plagiarism Detectors

Unlike AI detectors, plagiarism detectors work very efficiently. The intuition is simple.

Think about the number of words in the English language.

Let’s simplify and assume it is on the order of 10,000.3

Now, suppose that for any word, there are, let’s say, 10 possible grammatically and logically correct follow-up words.

Given an exact phrase of 9 words, what is the probability of randomly generating the same sequence?

The first word has a 1/10,000 = 10^-4 chance of being chosen.
Each of the following 8 words has a 1/10 = 10^-1 chance of being chosen, assuming the previous word determines the pool of possible follow-up words.
With the simplifying assumptions, the probability of randomly finding a specific sequence of 9 words is 1×10^−12, or one in a trillion.

Clearly, if you find a non-common sequence of 9 words, it will trigger alerts. Language is more complicated, but you get the intuition.

I suspect most AI software detectors use a plagiarism detector under the hood.

Why does ChatGPT plagiarize?

ChatGPT belongs to the family of large language models (LLMs). LLMs are trained in two phases. In the first part, the pretraining, the model is trained to predict the most likely next word given a sequence of text. This pretraining stage is responsible for inadvertently plagiarizing output.

For example, let’s focus on the sentence: “Higher interest rates can slow economic growth by making borrowing more expensive and reducing consumer spending.”

ChatGPT copied this sentence from an internet article.4

To understand why, we have to consider the context.

First, ChatGPT writes, “Stocks with high book-to-market ratios often belong to sectors that are sensitive to economic cycles, such as financials, industrials, and commodities.”

After that sentence, it begins writing: “Higher interest rates…”

What are the most likely words following this phrase according to its pretraining?

They are precisely “[higher interest rates] can slow down economic growth by making borrowing more expensive and reducing consumer spending.”

ChatGPT outputs this sentence verbatim because it is on the training sample.

The model is functioning as expected. It correctly learned the most likely words conditional on the instructions, what it wrote so far, and the beginning of the sentence. However, this functioning results in copying verbatim from an existing text, a practice society currently deems wrong.

In this post, we learned:

AI detectors do not work.
ChatGPT and other LLMs will plagiarize random text from the internet
The pretraining is responsible for this behavior
You can exploit this flaw to persuade your students not to use the raw output of these models.

Despite their limitations, LLMs can be incredibly advantageous when used correctly. In subsequent posts, I will write about how to use them effectively during classes and research. Please let me know if there is a particular topic you would like me to cover in the future.

Want to learn more about ChatGPT and support my writing? Consider buying a copy of my book. Presales are exceptionally important, according to my publisher.

https://www.amazon.com/Predictive-Edge-Generative-Financial-Forecasting/dp/1394242719

Thank you for subscribing!

https://arstechnica.com/information-technology/2023/07/why-ai-detectors-think-the-us-constitution-was-written-by-ai/

https://www.nytimes.com/2024/01/02/briefing/harvards-president-resigned-after-plagiarism-accusations.html

https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2016.01116/full

https://traders-trust.com/us-federal-interest-rates/

The Predictive Edge: ChatGPT, LLMs, ML, and Asset Pricing

Discussion about this post