CS 2210b Data Structures and Algorithms Assignment 2: Plagiarism

CS 2210b Data Structures and Algorithms
Assignment 2: Plagiarism Detection
Instructor: Olga Veksler
due February 8, 11:59 pm
January 13, 2015
1
Overview
In this assignment we will develop a simple program to detect possible plagiarism in Java program files. Our base
data structure is an unordered dictionary implemented with a hash table. If two files have a long common substring,
this indicates possible cheating. Consider the example in Figure 1, where a common substring is highlighted in bold.
Therefore, to detect cheating we will search for a large common substring between two input files.
Changing whitespace between the programs changes common substrings but does not change program semantics.
For example, in Figure 2, two programs are identical, except we removed all the white space in the program on the
right. Now the longest common substring between the two programs is just the word “while”. To be insensitive to the
spacing issues, we will simply remove all the whitespace from the input files.
Another issue is that by changing variable names with the search and replace operation, two programs which have
the same semantics can be made quite different. For example, in Figure 3, the only difference between two programs
is in the whitespace and variable names, making the longest common subsequence to be just “while (”.
To make sure our program insensitive to variable name change, we will use the following simple trick. Replace all
the user defined identifiers (that is the user defined variable names, method names, class names, etc.) with a character
’#’. For example, when all the user defined identifiers are replaced with a ’#’ and the white space is removed, the two
programs in Figure 3 appear identical, as shown in Figure 4. Notice that identifier ”while” is not replaced with a ’#’
since it is a Java built-in identifier.
Thus our first step, explained in details in Section 3, is to remove white space and and convert all user-defined
identifiers to ’#’ in the two Java program files that are to be checked for cheating. The result is store as a sequence of
tokens. For example, the program in Figure 2 on the left, it is stored as the following sequence:
while, (, # , <, 7, ), {, # , +, +, ;, } .
while (d < 7){
d++;
}
index = 0;
while (d < 7){
d++;
}
Figure 1: The longest common substring between the two programs is in bold.
1
while (d < 7){
d++;
}
while(d < 7){ d++;}
Figure 2: Due to whitespace difference, the longest common substring between two programs is just the word ”while”.
index = 0;
while (d < 7){
d++;
}
ind=0;
while (m < 7){
m++;
}
Figure 3: Two programs are identical except for variable names. The longest common substring is while (.
After we convert two input programs into a sequence of tokens, we need to find a common subsequence between
them. We could look for the largest common subsequence, but to simplify the assignment, we will just look for
a common subsequence of length l tokens. The length l is the user specified parameter given as a command line
argument. Notice that there maybe many common subsequences of length l tokens between the two files, your program
can output any pair of them. How to find a common subsequence efficiently using hash tables is explained in Section 5.
Suppose the input to the program are the two files in Figure 3, and l = 4, that is we are looking for a common
subsequence of tokens of length 4. We will parse both programs into a sequence of tokens, and suppose we find the
following common subsequence of length 4: { #,=,0, ; } . This subsequence corresponds to the text ”index = 0;”
in the first file and to text ”ind=0;” in the second file. If the output of your program is ”#=0;”, this is not easy to
analyze for a human operator. We should output the actual text in the program files that corresponds to the common
subsequence:
Found in the first file:
index = 0;
Found in the second file:
ind=0;
In order to output the actual program text, we will keep track of the starting and ending position of each token in
the input file. Thus together with its string value, for each token, we must store its starting and ending positions. In
the example above, the common sequence of tokens { #,=, 0, ; } starts at position 0 and ends in position 10 in the first
file, and starts at position 0 and ends in position 5 in the second file, so we can easily go back to the files and print the
file parts between these positions.
The name of your program should be ”CheatDetect”. The input to your program will be the file ”keywords.txt”,
the names of two files to check for cheating and an integer l which specifies the length of the common sequence
of tokens you should search for. All the inputs will be the command line arguments. The program should print to
the standard output the actual program text that corresponds to the common sequence found between the two files,
or report that there is no common sequence of length l. You can use the code I posted for this assignment, the Java
#=0;while(#<7){#++;}
#=0;while(#<7){#++;}
Figure 4: The two programs from Figure 3 with the white space removed and all identifiers replaced by #.
index = 0;
while (d < 7){
d++;
}
ind=0;
while (m > 8){
m = m-7;
}
Figure 5: The two input programs.
build-in classes String, LinkedList, and Iterator. If in doubt about which classes you are allowed to use, post a question
on the discussion board. All the other code has to be written by yourself. You cannot use code from the textbook,
Internet, and any other sources. Note that the actual program that we will use to detect any cheating in the class is
much more sophisticated than the one you are implementing.
2
Example
Suppose two files are as in Figure 5, and the file names are file1.java and file2.java. After running
CheatDetect keywords.txt file1.java file2.java 7
the output should be:
Found in file1:
index = 0;
while
(d
Found in file2:
ind=0;
while
(m
Even though your program can output any sequence of length 7, it happens that in this case there is only one and
therefore unique sequence of length 7, shown above.
3
Converting File to a Sequence of Tokens
To help you convert an input file to a sequence of Tokens, I provide you with a program called FileTokenRead.
FileTokenRead takes as an input a file name, opens the file and converts it to a sequence of tokens, together with
their starting and ending positions.
A token computed by FileTokenRead is a string that is either a Java’s built-in identifier, a user-defined identifier, or a single non white-space character string. The sequence of tokens computed by FileTokenRead has been
cleaned up from the white space, but user defined identifiers have not been replaced by ’#’. Therefore, you should
replace all the user-defined identifiers returned by FileTokenRead with ’#’.
Recall that a user-defined identifier must start with a letter character (either lowercase or uppercase) and is a
consecutive string of letter, number or underscore (’ ’) characters. I provide you with a text file ”keywords.tex” which
contains all Java built-in identifiers. Thus, given a token from FileTokenRead, you see if it is present in file
”keywords.tex”. there are three possibilities:
1. The token is not present in ”keywords.txt” and the token does not start with a letter character. In this case a
token is a non white-space character which you simply insert into your Token sequence without any changes.
2. The token is present in ”keywords.txt”. In this case, the token is a Java built-in identifier and you simply insert
this token into your own Token sequence without any changes.
3. The token is not present in ”keywords.txt” and the token starts with a letter character. In this case the token is a
user defined identifier. You should replace its string value by ”#” and insert it into your sequence of tokens.
I will describe how to efficiently look up whether a Token is present in the text file ”keywords.tex” in Section 4.
4
Searching Keywords
When you read a token from a Java program file, you have to efficiently find if it is in the file ”keywords.txt”. A hashtable based dictionary is a perfect choice. Read all keywords and non-whitespace characters from file ”keywords.txt”
and insert them in a hash-based dictionary K. This step should be done before you process the two input Java program
files. When processing input Java files, to look-up if a token is a Java build-in identifier, simply search if it is in the
dictionary K. Notice that in this case, we only need the key for the dictionary, we do not have any value associated with
the key. You dictionary implementation requires a value field though. So you can insert anything (null, for example)
into the value field.
The Java built-in identifiers are each on a separate line in the file ”keywords.txt”. You can write your own program for reading them from the file, or you can reuse program FileTokenread. Ignore positions computed by
FileTokenRead, as they are not needed for dictionary K.
5
Common Subsequence Detection Algorithm
At this point, we have already parsed the two files into a sequence of Tokens, where each Token has a String value
and a pair of integers that specify the start and end position of this token in the input file. Suppose the sequence of
tokens for the first file has name S1 , and the sequence of tokens for the second file has name S2 . We need to find a
common subsequence of length l between S1 and S2 . The simplest solution would be to perform an exhaustive search
between S1 and S2 . That is we could look at every subsequence of length k in S1 and compare it to every subsequence
of length l in S2 . Comparing one substring in S1 to every substring in S2 is O(n), and we have to repeat this O(n)
times, that is once for every string in S1 , thus the time complexity would be quadratic, which is too expensive. A more
efficient solution is using a dictionary based on a hash table, as follows.
Take all subsequences of length l from sequence S1 and insert them in a new, initially empty, hash-based dictionary
D. Recall that each Token consists of a String name and the starting and ending position. For example, here is a
sequence of three Tokens: { (while,10,14), ((,18,18), (#,19,30) }.
We need the key and value fields for dictionary D. The key will be simply the concatenation of the token names.
For example, for the Token sequence above, the key will be a string “while(# ”. The value will be the starting position
of the first Token in the sequence and the ending position of the last token, i.e. (10, 30) in this example.
After all the subsequences of length l from S1 were inserted into the dictionary D, we will start looking for a
matching subsequence in S2 . We will repeatedly take subsequences of length l from S2 and try to find them in the
dictionary D. Notice that we construct the key for a subsequence of S2 in exactly the same way as we did it for a
subsequence from S1 , simply concatenate all the token names into a single key. As soon as we find the first match, we
can stop. If, after examining all subsequences of length l from S2 we cannot find a match, no common subsequences
of length l exist between S1 and S2 .
6
Program Output
To output the matching strings, you can use the method getStringfromFile that I provided in the file
CheatDetect.java. This method takes as an input the file name, the start and end positions and returns the
contents of the file between the given positions (inclusively) as a String.
7
Classes Provided
7.1
Dictionary.java
This is the interface your dictionary should implement.
7.2
HashCode
This is the interface for the StringHashCode class.
7.3
TestHashDictionary
This is a program which the TA will use to test your hash table implementation. Compile and run it once you have
implemented your HashDictionary class. It will run some tests on your hash table and will let you know which tests
are passed/failed by your hash table. To get the full score on the assignment, you must pass all the tests. It will also
run your dictionary for different values of load factors, and report the average number of probes. I the average number
of probes is too high, (say above 5) it is a signal that you have not implemented hashCode very effectively.
7.4
Token
This is a class implementing Tokens. It has the following methods:
• public Token(String inputS int pStart, int pEnd): A constructor which takes the Token’s
String name, and Token’s starting and ending positions of type int.
• public String Value(): Returns the string associated with the token.
• public int startPosition(): Returns the start position.
• public int endPosition(): Returns the end position.
7.5
FileTokenRead
This is a program for reading Tokens from a file. You need 2 methods from this class:
• FileTokenRead(String name): This is a constructor which takes the name of the file to read tokens
from.
• public Iterator<Token> getIterator(): Returns iterator over tokens read from the input file.
Here is how to use the iterator:
FileTokenRead words = new FileTokenRead(fileName);
Iterator<Token> it = words.getIterator(); // grab the iterator into variable “it”
while (it.hasNext()) { // Check if anything is left in the iterator
Token next = it.next(); // get the next item in the iterator
....
}
7.6
Pair
This class is for storing the start and end positions of the token. The following methods are available:
• public Pair(int s, int e): A constructor which takes as an input the start and end positions.
• public int Start(): Returns the start position of Pair.
• public int End(): Returns the end position of the Pair.
You can implement any other methods that you want, but they must be declared as private methods.
8
8.1
Classes Partially Implemented
CheatDetect
This is the main program. It is only partially implemented, namely the method getStringfromFile(String
name,int start,int end) is implemented. Section 6 describes what this method does.
9
Classes to Implement
In this section I describe the classes you have to implement. You may implement any other classes that you find
necessary.
9.1
Entry
This class represents an entry in the dictionary. For this class, you must implement all and only the following public
methods:
• public Entry(String key, Pair value): A constructor which takes a key of type String and
a value of type Pair.
• public String Key(): Returns the key in the Entry.
• public Pair Value(): Returns the value in the Entry.
You can implement any other methods that you want, but they must be declared as private methods.
9.2
DictionaryException
This is the class for exception that should be thrown by your dictionary in case of unexpected conditions.
9.3
StringHashCode
This class implements HashCode interface and should be used to get a hash code for strings. You have to use the
polynomial accumulation hash code for strings we talked about in class. It must only have one public method:
public int giveCode(Object key)
You can implement any other methods that you want, but they must be declared as private methods. You pass
an object of this type to the constructor of the hash table, which then assigns it to a private object (let’s say its name is
hCode), of class HashCode. The hash table uses the object whenever it needs a hash code:
hCode.giveCode(key)
9.4
HashDictionary
This class implements a dictionary based on a hash table, and should implement the provided Dictionary interface.
You should use open addressing with double hashing strategy. Start with initial hash table of size 7. Increase its size
to the next prime number at least twice larger than the current array size (which is N) when the load factor gets larger
than the maximum allowed load factor (maximum allowed load factor is to be given to the constructor to the hash
table). You must design your hash function so that it produces few collisions. A bad hash function that induces many
collisions will result in lowering of your mark.
You must implement the following public methods.
• public HashDictionary() throws DictionaryException We don’t want the user to use the
default constructor since the user has to specify the HashCode, and maximum load factor at the construction
time. Thus this default constructor must simply throw DictionaryException if it is ever called.
• public HashDictionary(HashCode inputCode, float inputLoadFactor) This is the constructor for the hash table which takes HashCode object and float inputLoadFactor which specifies the maximum
allowed load factor for the hash table. If the load factor becomes larger then inputLoadFactor, the (private)
rehash() method must be invoked.
• public Entry find(String key) If entry with this key exists in the dictionary, returns this entry,
otherwise returns null.
• public Entry remove(String key) throws DictionaryException Removes and returns
entry with specified key. Throws exception if no such entry exists.
• public void insert(String key, Pair value). This method inserts a new entry in the Dictionary, with the specified key and value.
• public int size() Returns the number of entries in the dictionary.
• public float averNumProbes(). This method returns an average number of probes performed by your
hash table so far. You should count the total number of operations performed by the hash table ( each of find,
insert, remove count as one operation, don’t count any other operations) and also the total number of probes
performed so far by the hash table. When averNumProbes() is called, it should return ( (float) num probes
so far)/(num operations so far). As you increase the maximum allowed load factor, the average number of probes
should go up. When you run TestHashDictionary program, it will run your hash table at different load
factors and will print out the average probe numbers versus running time. If you see that average probe number
goes up as the max load factor goes up, you are probably computing probes/implementing hash table correctly.
You can implement any other methods that you want, but they must be declared as private methods. Also any
member variables must be private.
10
Coding Style
Your mark will be based partly on your coding style.
• Variable and method names should be chosen to reflect their purpose in the program.
• Comments, indenting, and whitespace should be used to improve readability.
• No variable declarations should appear outside methods (“instance variables”) unless they contain data which
is to be maintained in the object from call to call. In other words, variables which are needed only inside
methods, whose value does not have to be remembered until the next method call, should be declared inside
those methods.
• All variables declared outside methods (“instance variables”) should be declared private (not protected)
to maximize information hiding. Any access to the variables should be done with accessor methods, like Key()
and Value() for the Entry.
10.1
Hints
When you call the program CheatDetect keywords.txt f1.java f2.java 7 the command line arguments will be stored as follows:
• args[0]=”keywords.txt”,
• args[1] = ”f1.java”
• args[2] = ”f2.java”
• args[3] = ”7”
To convert ”7” from a String to int, you can use: int length = (new Integer(args[3])).intValue();
10.2
Grading
Your grade will be computed as follows.
• Program compiles, produces a meaningful output: 15 marks
• TestHashDictionary tests pass: 40 marks (8 tests, each worth 5 marks)
• Coding Style: 10 marks
• Hash table implementation: 20 marks
• CheatDetect program implementation: 15 marks.
11
Sample Files Provided
• keywords.txt: java keywords
• file1.java: java program test file 1
• file2.java: java program test file 2
• output.txt: output for file1.java file2.java with length = 28
12
Handing In Your Program
Submit your assignment via online OWL system.