Plagiarism detection for program source code | Prof. James Andrew Smith's Blog

JPlag figure showing connections between student code

Detecting copied source code in student assignments is important. Here, I'll explore a few options that are possible for instructors here at York University, but these options may also work for instructors elsewhere.

Look, there are a few notable reasons why copied source code will appear in a student's assignment:

Motivation
Low skill
Lack of time
Convenience

Given the convenience of the ground-breaking ChatGPT or old-school sources like Stack Overflow, Discord groups or just an in-person chat in the university library, an unmotivated or poorly-prepared student will likely hand in copied code. But cheating has been going on for ever... and will continue to as long as the risk-reward balance is in favour of the cheaters.

Ferris Bueller's Day Off quote:

They bought it. — Ferris Bueller, the patron saint of unscrupulous students.

For the most part, it falls on individual instructors and their teaching assistants to hold the line on academic integrity. Here I'll talk about a few interesting tools that can be brought to bear by those of us with the energy and motivation to pursue anti-cheating measures.

Virtual Programming Lab

All Virtual Programming Lab activities on eClass have a "similarity" tab. Select the maximum number of files to scan (I chose 400) and then "all files" for "Files to scan" and nothing for "other sources to add to the scan", then "Search".

VPL will then report back and give you numbered "clusters".

You can then select from among the submissions in each cluster and it will provide you with a two column view and match symbols between them.

I haven't used this much as I tend to use other tools like JPlag, instead and, so, I need to run some comprehensive tests to see if VPL's similarity tool work well.

JPlag

JPlag comes to us from Karlsruhe, in Germany. Like MOSS, it's designed to be used primarily from the command line. I wrote a paper for CEEA 2024 called "A Teacher's Experience with Virtual Programming Lab" in which I described the use of JPlag.

The biggest advantage JPlag has over MOSS is that the processing is done on your local machine. And the reports generated by JPlag are viewable in a browser on your local machine and none of the student data gets sent elsewhere. As of December 2024, are three main versions, with two distinct report types:

Version 2 (2019 / "Legacy")
Version 4 (2023)
Version 5. (2024)

Counterintuitively, I'm using the older, legacy, Version 2 for two reasons:

The V2 reports do a better job of comparatively visualizing individual subsections student code blocks. V4 doesn't appear to do so and is the topic of an outstanding note on the JPlag forum.
The V2 reports are generated and visualized on the instructor's computer and not on or with an external tool. While the V4 visualizer on GitHub states that it runs in the browser, it's not as reassuring as the static local pages that are generated in V2.

(I have not yet tried Version 5)

To operate JPlag, I did the following:

installed Java on my Windows computer
installed PowerShell.
downloaded all the code submissions from my students from eClass. I then put them all in Downloads/plagiarismTest/projectCode
downloaded jPlag version 2.12 into Downloads/plagiarismTest and renamed it as jplag212.jar
made a directory called Downloads/plagiarismTest/projectCode/baseCode and put the teacher template of the source code there.
(permissions may need to be adjusted on the .jar file)
ran the following (assuming that the students have submitted java source code. Change java19 to your own language):

java -jar .\jplag212.jar projectCode -l java19 -bc baseCode -r reportsProjects
command for JPlag.

The results are in the index.html file found in reportsProjects.

MOSS

MOSS is the go-to standard for many instructors in Computer Science and Engineering. It's widely considered to have well implemented algorithms. There are some graphical front-ends to it but, like JPlag, it's designed to be used from the command line.

MOSS has a glaring downside: data must be sent to a server in the United States. If you're in the EU or if you work at an institution that believes that data should not leave the country, then using the MOSS server could get you in trouble. Plus, the web connection to the server is not secured. It's a simple HTTP.

To use moss you have to register with the server. That generally means sending an email from a gmail account and waiting for a reply. In the reply is some text that includes your unique user ID (a XXX digit number). No need to modify it. Just copy and paste the right portion of the text into a text editor. It's a perl script and should be able to run from the terminal... it might require permission changes.

Testing via macOS terminal:

perl moss.pl -l java *.java
command for MOSS

It will then return a URL to an unsecured webpage containing the comparisons and analysis.

MATLAB mfilecompare

Matlab cheat graph, annotated to illustrate one match. Student names have been removed.

Unlike jPlag, this only seems to do a one-to-one comparison, but it could be helpful to narrow down searches for cheating.

This is a new one for 2024, available as a free download from The Mathworks. It allows for comparison of source code found in Matlab .m files. The you need to download all the files from eClass and change the file names to to subdirectory names that came from eClass. Then move all those files into the folder where the Matlab script resides. Run the script and it will provide you with a 2D graph where the dots with a white intensity that is proportional to the similarity between the two student files that were compared.

Turn-it-In

While Turn-it-in is available on eClass (Moodle) at York, it's not designed to do source code checking. So, it might be able to do naïve checking of text but it isn't aware of programming language syntax, comments, whitespace, etc. The fact that there's a 2021 writeup on the Turn-it-in website that categorizes source code checking as an "emerging trend", it's clear that the company missed the boat... by about two decades! That writeup belongs to 2001, not 2021. Turn-it-in appears to be happy to sit on its contracts and IP and do little to improve its product unless it's forced to.

Alternatives

Look, it's pretty clear that cheating tools have been given a boost lately, but the fundamentals have not changed -- good grades are desirable but that requires time and hard work. Add in the convenience and accuracy of cheating tools and you've got all you need for cheating to happen at scale.

We need to assume that any student work that is created and submitted out in the open is corrupted in an academic integrity sense. For this reason, it should be standard practice to run student work through automated cheat-detection tools to get, at minimum, a sense for whether the problem is likely to be present and to form a baseline for manual follow-up. In large classes this can be particularly effective both because there is more data for the tools to sift through but also because to do so without tools is just impractical.

James Andrew Smith is a Professional Engineer and Associate Professor in the Electrical Engineering and Computer Science Department of York University’s Lassonde School, with degrees in Electrical and Mechanical Engineering from the University of Alberta and McGill University. Previously a program director in biomedical engineering, his research background spans robotics, locomotion, human birth, music and engineering education. While on sabbatical in 2018-19 with his wife and kids he lived in Strasbourg, France and he taught at the INSA Strasbourg and Hochschule Karlsruhe and wrote about his personal and professional perspectives. James is a proponent of using social media to advocate for justice, equity, diversity and inclusion as well as evidence-based applications of research in the public sphere. You can find him on Twitter. Originally from Québec City, he now lives in Toronto, Canada.