How to make a plagiarism detector in Python

Build a plagiarism detection engine that can help you understand string matching, file operations, and the user interface. You also discover natural language processing techniques to enhance applications.

In this article, let's learn how to build a plagiarism checker and powerful features of the Difflib module with TipsMake.com!

Tkinter and Difflib . Modules

To build a plagiarism detector, you will use the Tkinter and Difflib modules. Tkinter is a simple, cross-platform library that you can use to create graphical user interfaces quickly.

The Difflib module is part of the Python standard library, which provides classes and functions that compare strings such as strings, lists, and files. Thanks to it, you can build programs like autocorrect text, simple version control system or a text summarization tool.

How to build a plagiarism detector in Python

Import the required modules. Define a method load_file_or_display_contents() that takes entry and text_widget as arguments. This method will load a text file and display its content in a text widget.

 

Use get() to get the file path. If the user does not enter any information, use askopenfilename() to open a file dialog window to select the file you want to check for plagiarism. If the user selects this file path, deletes the previous entry, if any, from start to finish and inserts the selected path.

import tkinter as tk from tkinter import filedialog from difflib import SequenceMatcher def load_file_or_display_contents(entry, text_widget): file_path = entry.get() if not file_path: file_path = filedialog.askopenfilename() if file_path: entry.delete(0, tk.END) entry.insert(tk.END, file_path)

Open the file in read mode and save the content in the text variable . Delete the content of text_widget and insert the text you retrieved earlier.

 with open(file_path, 'r') as file: text = file.read() text_widget.delete(1.0, tk.END) text_widget.insert(tk.END, text)

Define a method, compare_text() that you will use to compare two pieces of text and calculate the percentage similarity. Use Difflib's SequenceMatcher() class to compare strings and determine similarities. Set the custom comparison function to None to use the default comparison and pass the text you want to compare.

Use scaling to determine similarity in a floating-point format that you can use to calculate percentage similarity. Use get_opcodes() to retrieve a group of operations that you can use to highlight similar sections of text and return that section along with the percentage of similarity.

def compare_text(text1, text2): d = SequenceMatcher(None, text1, text2) similarity_ratio = d.ratio() similarity_percentage = int(similarity_ratio * 100) diff = list(d.get_opcodes()) return similarity_percentage, diff

Define a show_similarity() method . Use get() to retrieve the text from both text boxes and feed them to the compare_text() function . Delete the content of the resulting textbox and insert the percentage of similarity. Remove the ' same ' tag from the previous highlight (if any).

 

def show_similarity(): text1 = text_textbox1.get(1.0, tk.END) text2 = text_textbox2.get(1.0, tk.END) similarity_percentage, diff = compare_text(text1, text2) text_textbox_diff.delete(1.0, tk.END) text_textbox_diff.insert(tk.END, f"Similarity: {similarity_percentage}%") text_textbox1.tag_remove("same", "1.0", tk.END) text_textbox2.tag_remove("same", "1.0", tk.END)

get_opcode() returns 5 tuples: opcode string, first string start index, first string end index, second string start index, and second string end index.

The opcode string can be one of four values: replace, delete, insert, and equal. You would use replace when part of the text in both strings is different, and someone has replaced part of the content with another. Delete will be used when part of the text exists in the first string, not the second.

Insert is used when part of the text is not present in the first string but in the second string. You get equal results when the pieces of content are the same. Store all these values ​​in the appropriate variables. If the opcode string is equal , add the same tag to the text string.

 for opcode in diff: tag = opcode[0] start1 = opcode[1] end1 = opcode[2] start2 = opcode[3] end2 = opcode[4] if tag == "equal": text_textbox1.tag_add("same", f"1.0+{start1}c", f"1.0+{end1}c") text_textbox2.tag_add("same", f"1.0+{start2}c", f"1.0+{end2}c")

Initialize the Tkinter root window. Name the window and define a frame within it. Arrange the frame with appropriate padding in both directions. Define two labels to show Text 1 and Text 2 . Set the parent component it's inside and what it displays.

Define 3 text boxes, two for the text you want to compare and one to show the results. Declare the parent element, width and height, set the packing option to tk.WORD to ensure that the program wraps words at the nearest boundary and doesn't break any words in between.

root = tk.Tk() root.title("Text Comparison Tool") frame = tk.Frame(root) frame.pack(padx=10, pady=10) text_label1 = tk.Label(frame, text="Text 1:") text_label1.grid(row=0, column=0, padx=5, pady=5) text_textbox1 = tk.Text(frame, wrap=tk.WORD, width=40, height=10) text_textbox1.grid(row=0, column=1, padx=5, pady=5) text_label2 = tk.Label(frame, text="Text 2:") text_label2.grid(row=0, column=2, padx=5, pady=5) text_textbox2 = tk.Text(frame, wrap=tk.WORD, width=40, height=10) text_textbox2.grid(row=0, column=3, padx=5, pady=5)

Define 3 buttons, two to download files and one to compare. Specifies the parent element, the text it will display and the function it will run when it is clicked. Create two input widgets to enter the file path and define the parent element and its width.

 

Arrange all these elements in rows and columns using the grid manager. Use pack to sort compare_button & text_textbox_diff . Add the appropriate padding at the required position.

file_entry1 = tk.Entry(frame, width=50) file_entry1.grid(row=1, column=2, columnspan=2, padx=5, pady=5) load_button1 = tk.Button(frame, text="Load File 1", command=lambda: load_file_or_display_contents(file_entry1, text_textbox1)) load_button1.grid(row=1, column=0, padx=5, pady=5, columnspan=2) file_entry2 = tk.Entry(frame, width=50) file_entry2.grid(row=2, column=2, columnspan=2, padx=5, pady=5) load_button2 = tk.Button(frame, text="Load File 2", command=lambda: load_file_or_display_contents(file_entry2, text_textbox2)) load_button2.grid(row=2, column=0, padx=5, pady=5, columnspan=2) compare_button = tk.Button(root, text="Compare", command=show_similarity) compare_button.pack(pady=5) text_textbox_diff = tk.Text(root, wrap=tk.WORD, width=80, height=1) text_textbox_diff.pack(padx=10, pady=10)

Highlight text has been highlighted the same on yellow background and red font color.

text_textbox1.tag_configure("same", foreground="red", background="lightyellow") text_textbox2.tag_configure("same", foreground="red", background="lightyellow")

The mainloop() function tells Python to loop through the Tkinter event and listen for the event until you close the window.

root.mainloop()

Put it all together and run the code to detect plagiarism.

Example results of plagiarism detection tool

When running this program, it shows a window. When the Load File 1 button is pressed, a file dialog box opens and asks you to select the file. When selecting a file, this program displays the contents of the first text box. When entering the path and clicking Load File 2 , the program displays the content in the second text box. When you click the Compare button , you will have 100% similarity and it will highlight all the same text exactly.

Picture 1 of How to make a plagiarism detector in Python

If you add another line to a textbox, and then click Compare , this program highlights the same part and keeps the rest.

Picture 2 of How to make a plagiarism detector in Python

If there are very few similarities, the program highlights some letters or words, but the percentage of similarity is quite low.

Picture 3 of How to make a plagiarism detector in Python

 

Above is how to create a plagiarism detection tool in Python . As you can see, it's pretty simple, isn't it? Good luck!

Update 10 August 2023
Category

System

Mac OS X

Hardware

Game

Tech info

Technology

Science

Life

Application

Electric

Program

Mobile