Shivam Dutt Sharma
13 min readJan 7, 2024

Navigating the Challenges: Teachers’ Struggles with Identifying LLM-Generated Text and Plagiarism

In the last few days, I have been visiting Kaggle for a lot many reasons. Majority time, it has been to find some relevant datasets of banking related use-cases to aid or train a Conversational AI product I have been building which shall provide an analytical or insights-discovery platform to the users for querying data and unearthing insights but in a conversational format as opposed to doing all the heavy-lifting (right from setting-up Python / R notebooks to the establishment of rules to automating the jobs / pipelines). And let me halt the story of Conversational AI right here for now, as my agenda is to focus the lens on the latest concern on the scene — the issue of plagiarism in academics where teachers find it quite difficult as well as concerning to distinguish between LLM-generated and manually written texts from students in their assignments / exams today and talk about it why, because it has been my area of interest for long.

So as I had said that I have been visiting Kaggle for a lot of reasons and majorly it has been about data / dataset-hunting, however this one time, I was tempted to click on Competitions, after almost 6 months, only to see a very interesting competition (I shall not go into the details due to the data and code licensing restrictions, posing during an active competition | PS: if you happen to read / see this article while that competition is active, I’d request you to please not share / amplify the code I have mentioned beneath, ahead, because that shall not be legal and will be against the regulations. I am sharing the code just for knowledge purpose. So, please just read the code and prohibit from sharing it. Ty :) ), but to give you a high level view — the competition is about identifying / differentiating between a LLM (ChatGPT) — generated text and a manually written one.

And that is where it prompted me to dive a bit deeper into this subject and unearth some roots and analyse.

To begin with, am sharing two paragraphs below. Each paragraph is a short-essay of “Terrorism at workplace”. One of them is LLM-generated while the other is written by me. How easy / difficult is it for you to identify which is which? Please leave your answer / response in the comments.

#1

Terrorism at the workplace is a grave concern that demands unwavering attention from employers, employees, and authorities alike. This ominous threat manifests in various forms, from physical violence to cyber-attacks, creating an atmosphere of fear and uncertainty. Employers must prioritize the safety and well-being of their workforce by implementing comprehensive security measures, conducting regular risk assessments, and providing relevant training on identifying and responding to potential threats. Promoting a culture of open communication and vigilance is paramount to fostering a resilient workforce that can collectively thwart any potential acts of terrorism. Collaboration with law enforcement agencies, the establishment of emergency response plans, and the utilization of advanced technologies for surveillance and threat detection are crucial components of a multifaceted strategy to mitigate the risk of terrorism at the workplace. By fostering a proactive and vigilant environment, organizations can contribute to creating a workplace where the safety of employees is a top priority, ensuring that the threat of terrorism is minimized, and individuals can carry out their professional duties in a secure and protected environment.

#2

Vigilance is the need of the hour. Whether it is at the border, defending the nation from the enemies, or our households ensuring safety and well-being of our women and children, or our workplaces where in the form of work pressure and politics, there is sometimes a subtle terrorism that prevails on the floor. What may sometimes originate in the form of a normal reprimand, can take the form of a severe intimidating remark too, often going unnoticed by the people around but impacting the stability and mental peace of the person on the receiving end. Whether it is around their personal ulterior motives, or around their genuine concerns for the sub-ordinate’s substandard performance, supervisors and HODs often end up losing their cool and also using threatening measures thinking that those measures may intimidate the person on the receiving end and lead him/her to either resort to their terms or withdraw from the position. In both the cases, it becomes a fundamental situational crisis for the victim and his/her safety and mental wellbeing is compromised. It is important that the organizations enforce strict measures on the floor towards making a safe and threat-free environment for the employees.

For some of you, it might be relatively quite easy to distinguish between LLM-generated texts VS manually written, due to your expertise in the written / spoken forms of the particular language and the related linguistics in general. However, if we look at it largely from a lay-man’s point of view, who isn’t very savvy in the area of linguistics, will fail miserably in drawing the distinction.

Even in the cases of primary school teachers, who suppose, have given an assignment to their students to write essays on various topics; have a hard time curbing the plagiarism. Considering that the students all over the world, these days, have access to the ChatGPTs of the world; there is a high propensity that the students may end up using these LLMs to write their assignments.
It won’t be too much to believe that even for an expert of linguistics, it sometimes becomes too difficult to differentiate between a LLM-generated language vs manually written (natural language) because of the variety both an LLM and a human brain can bring to the table. If not in the near future, then may be in the longer one, both will meet at the crossroad (means the AI will get developed so well that it will be able to mimic a human brain almost 100%).

This was about an expert of linguistics, and here, we are dealing with primary school English teachers who may not have the deep tech finesse to differentiate between an LLM-generated text and a manually written one; hence it is an absolute knife-edge for them, IMO.

But truly, thanks to Machine Learning, that the distinction can be made.

Nevermind :p

Before we look at the code, for beginners : we will be looking at a classification model which will differentiate between an LLM-generated text with a manually written one.
I have used MultinomialNB and RandomForestClassifier models to do the job.

Training Data

Let’s see the kind of training dataset we will be working on. This is just a dummy representation of the actual data.

The data shall have four columns, namely :-

  • essay_id : This will be the unique id assigned to the essay.
  • topic_id : This will be the unique id assigned to each of the few topics that the students have written an essay on.
  • essay_text : This will be the actual text of the essay.
  • llm_generated : This will be a binary value (either 1 or 0). 1 signifying that the essay was llm-generated and 0 signifying that the essay was manually written (not llm-generated).

Classification Models

It is always better as they say, to go for ensembling methods for classifiers and experts.
To get better accuracy, I decided to train a classification model twice, once with a MultinomialNB and the other time with RandomForestClassifier.

In each of the two cases, the accuracy is not going beyond 70% for now when I run the code on Kaggle, which is obviously a concern, but am tackling it separately and will cover it in a different article.
Though scrolling a few folds down, you will encounter how I have also talked about my surreal experience with an unreal 100% accuracy when implementing MultinomialNB classifier model, a little later in the article; as you will see.

MultinomialNB : Multinomial Naive Bayes Classifier Model is very simple and effective for text classification tasks and remains a popular choice for long. It is famous for handling discrete data. The model has a probabilistic nature and its ability to handle discrete data is the talk of the town. All in all, it is a very versatile algorithm for applications ranging from spam detection to sentiment analysis. Today, I have employed it for a classification use case, where I am trying to look at a group of essays written by students of a primary school, and trying to identify that which of those are LLM-generated and which of those are written manually (actually) by the students. I am not saying I am going to help the teachers with this! Or, am I? 😝

RandomForestClassifier : A Random Forest Algorithm is a supervised machine learning algorithm which is very popular in the world of Data Science and Machine Learning for its use-cases in Classification and Regression problems.
A little interesting way to understand the acknowledge the merits of RandomForestClassifier is to draw an analogy with forests; how a forest comprises numerous trees, and the more trees more it will be robust.

One of the main advantages of RandomForestClassifier, is that they can handle high-dimensional and sparse data sets like Virat handled the pressure on Oct 23, 2023 (if you know you know), which are common in text analysis. RandomForestClassifier models are also very popular for their abilities in dealing with missing values, outliers, and imbalanced classes, which can affect the performance of other algorithms.

Code Time

First attempt : MultinomialNB

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Loading the training data
train_data = pd.read_csv(r"-----path to the file which has the train dataset------")

# Splitting the data into training and validation sets
train_set, val_set = train_test_split(train_data, test_size=0.2, random_state=42)

# Preprocessing the text data using TF-IDF vectorization
vectorizer = TfidfVectorizer(max_features=5000, stop_words="english")
X_train = vectorizer.fit_transform(train_set["text"])
X_val = vectorizer.transform(val_set["text"])

# Creating the target labels
y_train = train_set["generated"] # Here "generated" is the column indicating whether the essay was generated by a student or an LLM
y_val = val_set["generated"]

# Training a simple Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Making predictions on the validation set
# predictions = classifier.predict(X_val)
predictions = classifier.predict_proba(X_val)[:, 1] #here we are trying to get a probability score of whether an essay is manually written or llm-generated
# Evaluating the model
accuracy = accuracy_score(y_val, predictions)
report = classification_report(y_val, predictions)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)

# Using the trained model to make predictions on the test set

test_data = pd.read_csv(r"-----path to the file which has the test dataset------")
X_test = vectorizer.transform(test_data["text"])
test_predictions = classifier.predict_proba(X_test)[:, 1]
# test_data["generated"] = test_predictions
submission_df = pd.DataFrame({"id": test_data["id"], "generated": test_predictions})
submission_df.to_csv("submission.csv", index=False)

On running this code on Jupyter, I got the following accuracy :-

And this certainly got me worried since I wasn’t really expecting the model to spit a 1'er.
To get to the depth of it, I surfed more on why a MultinomialNB may give you a 100% accuracy.
So the reasons I found are :-

  • Data overlap / leakage : Here, the idea is to ensure that there is no type of data leakage and that the information from the test datasets is not accidentally present in the training datasets. This can also artificially inflate accuracy.
  • Overfitting : The MultinomialNB model you have trained may be a victim of overfitting. This means that the model may be capturing noise in the data and adjusting with specific patterns that do not generalize well to the new data. As a result, you may be observing high accuracy on the training dataset, but getting a low accuracy on the test dataset.
  • Imbalanced classes : As it is a classification task, so there are labels expected to be in your training dataset. These labels are the categories that each of the records are grouped under, or assigned to. It may be happening that the records are unevenly distributed / assigned to each label, because of which the model is predicting the majority class, and hence getting a 100% accuracy (or whatever number you are getting that is leaving you dumbfound).
  • Incorrect / Incomplete implementation : Sometimes, we just end up implementing a very crude form of a classification or any other ML model. What I mean here is that while the choice of algorithm plays a crucial role in your job of classification or anything else you are doing, but the quality of the data and the encapsulated features play an equally important role. Some of the recommendations made by Data Scientists around the world are : 1) ensure the data is well pre-processed, 2) the feature scaling is done, 3) the data is properly split into training and testing datasets.
  • Small Dataset : If in case your dataset is small, the model may end up memorizing all the training examples, leading to a perfect accuracy on the training set but poor generalization of it.

Trying to closely re-visit my model and the training data-set, I tried to see which of the reasons may be true in my case. It turns out that my MultinomialNB model is being subjected to imbalanced classes and a small dataset, due to which the accuracy is strangely 100%.

Now that you and I are aware of the situation, what I’d suggest is that when you train your own model, you can take care of the last two points at your end, which will help you stay away from unreal 100% accuracy gimmicks.

Second attempt : MultinomialNB (adjusted)

Here, what I have done over and above the previous standard MultinomialNB code is that there are a few adjustments done in the form of params.

  • Increased the max_features to 10000 in TfidfVectorizer (max_features is used to characterize texts in the training/classification process, by setting an upper limit).
  • Set the ngram_range to (1,2) which means that am asking the model to look for both unigrams and bigrams in the data. Restricting the model to only look for unigrams in the model for submission to bag of words, will obviously affect / impact the accuracy.
  • set alpha hyper-parameter’s value to 0.1 for smoothing. This smoothing technique is commonly used in text classification tasks, where features (words) may not appear in all classes, resulting in zero probabilities.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Assuming you have already loaded and preprocessed your training data

# Split the data into training and validation sets
train_set, val_set = train_test_split(train_data, test_size=0.2, random_state=42)

# Preprocess the text data using TF-IDF vectorization
vectorizer = TfidfVectorizer(max_features=10000, stop_words="english", ngram_range=(1, 2))
X_train = vectorizer.fit_transform(train_set["text"])
X_val = vectorizer.transform(val_set["text"])

# Create the target labels
y_train = train_set["generated"] # Assuming "label" is the column indicating student or LLM
y_val = val_set["generated"]

# Train a Multinomial Naive Bayes classifier with tuned hyperparameters
classifier = MultinomialNB(alpha=0.1) # Adjust alpha based on your hyperparameter tuning
classifier.fit(X_train, y_train)

# Make predictions on the validation set
val_predictions = classifier.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, val_predictions)
print("Accuracy:", accuracy)

# Print classification report and confusion matrix for more insights
print("Classification Report:")
print(classification_report(y_val, val_predictions))

print("Confusion Matrix:")
print(confusion_matrix(y_val, val_predictions))

Third attempt : RandomForestClassifier

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Assuming you have already loaded and preprocessed your training data

# Split the data into training and validation sets
train_set, val_set = train_test_split(train_data, test_size=0.2, random_state=42)

# Preprocess the text data using TF-IDF vectorization
vectorizer = TfidfVectorizer(max_features=10000, stop_words="english", ngram_range=(1, 2))
X_train = vectorizer.fit_transform(train_set["text"])
X_val = vectorizer.transform(val_set["text"])

# Create the target labels
y_train = train_set["generated"] # Assuming "label" is the column indicating student or LLM
y_val = val_set["generated"]

# Train a Random Forest classifier with tuned hyperparameters
classifier = RandomForestClassifier(n_estimators=100, max_depth=50, random_state=42)
classifier.fit(X_train, y_train)

# Make predictions on the validation set
val_predictions = classifier.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, val_predictions)
print("Accuracy:", accuracy)

# Print classification report and confusion matrix for more insights
print("Classification Report:")
print(classification_report(y_val, val_predictions))

print("Confusion Matrix:")
print(confusion_matrix(y_val, val_predictions))

In both my attempt 2 and 3, the accuracy I was getting was again 100% on my local Jupyter notebook. So both the attempts were more like nailing jelly to a wall on that front.

However, on Kaggle, my 3rd attempt helped me get a furtherance in the accuracy from what I was achieving in my 1st attempt with MultinomialNB. I reckon you know, why the accuracies have been different for me when I am implementing the models on my local VS when am executing the models on Kaggle. The reason is simple : it is because on my local the dataset am using is quite small and hence the accuracy I am getting is 100% (as I have also mentioned earlier in the article where I listed the grid of possible reasons). However, on Kaggle, the community (or jury I should better say) validates your accuracy and qualifies it against a hidden dataset which is much bigger and diverse.

I got to go and grab a cup of coffee now. I am not sure if I’d want to write anymore today. So, if you have a comment to drop, please do. I will try to look at it the earliest and respond. I sense it is my first article in 2024, so for all of you who have read it so far — HAPPY NEW YEAR FRIENDS :)
The winters are gloomy, but there is a fire within.

And yes, if by any chance there has been a teacher who has read this piece till the end, I hope it made sense to you (as you were my targeted audience). I have spent close to a week explaining my Mom how ChatGPT really works and guess what I can see a curious student in her now finally :p (role reversals — coming of the age haha)

Godspeed. Bye :)